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Preface 


This  is  a  book  for  people  interested  in  solving  optimization  problems.  Because  of  the  wide 
(and  growing)  use  of  optimization  in  science,  engineering,  economics,  and  industry,  it  is 
essential  for  students  and  practitioners  alike  to  develop  an  understanding  of  optimization 
algorithms.  Knowledge  of  the  capabilities  and  limitations  of  these  algorithms  leads  to  a  better 
understanding  of  their  impact  on  various  applications,  and  points  the  way  to  future  research 
on  improving  and  extending  optimization  algorithms  and  software.  Our  goal  in  this  book 
is  to  give  a  comprehensive  description  of  the  most  powerful,  state-of-the-art,  techniques 
for  solving  continuous  optimization  problems.  By  presenting  the  motivating  ideas  for  each 
algorithm,  we  try  to  stimulate  the  reader’s  intuition  and  make  the  technical  details  easier  to 
follow.  Formal  mathematical  requirements  are  kept  to  a  minimum. 

Because  of  our  focus  on  continuous  problems,  we  have  omitted  discussion  of  impor¬ 
tant  optimization  topics  such  as  discrete  and  stochastic  optimization.  However,  there  are  a 
great  many  applications  that  can  be  formulated  as  continuous  optimization  problems;  for 
instance, 

finding  the  optimal  trajectory  for  an  aircraft  or  a  robot  arm; 

identifying  the  seismic  properties  of  a  piece  of  the  earth’s  crust  by  fitting  a  model  of 

the  region  under  study  to  a  set  of  readings  from  a  network  of  recording  stations; 
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designing  a  portfolio  of  investments  to  maximize  expected  return  while  maintaining 

an  acceptable  level  of  risk; 

controlling  a  chemical  process  or  a  mechanical  device  to  optimize  performance  or 

meet  standards  of  robustness; 

computing  the  optimal  shape  of  an  automobile  or  aircraft  component. 

Every  year  optimization  algorithms  are  being  called  on  to  handle  problems  that 
are  much  larger  and  complex  than  in  the  past.  Accordingly,  the  book  emphasizes  large- 
scale  optimization  techniques,  such  as  interior-point  methods,  inexact  Newton  methods, 
limited-memory  methods,  and  the  role  of  partially  separable  functions  and  automatic 
differentiation.  It  treats  important  topics  such  as  trust-region  methods  and  sequential 
quadratic  programming  more  thoroughly  than  existing  texts,  and  includes  comprehensive 
discussion  of  such  “core  curriculum”  topics  as  constrained  optimization  theory,  Newton 
and  quasi-Newton  methods,  nonlinear  least  squares  and  nonlinear  equations,  the  simplex 
method,  and  penalty  and  barrier  methods  for  nonlinear  programming. 

The  Audience 

We  intend  that  this  book  will  be  used  in  graduate-level  courses  in  optimization,  as  of¬ 
fered  in  engineering,  operations  research,  computer  science,  and  mathematics  departments. 
There  is  enough  material  here  for  a  two-semester  (or  three-quarter)  sequence  of  courses. 
We  hope,  too,  that  this  book  will  be  used  by  practitioners  in  engineering,  basic  science,  and 
industry,  and  our  presentation  style  is  intended  to  facilitate  self-study.  Since  the  book  treats 
a  number  of  new  algorithms  and  ideas  that  have  not  been  described  in  earlier  textbooks,  we 
hope  that  this  book  will  also  be  a  useful  reference  for  optimization  researchers. 

Prerequisites  for  this  book  include  some  knowledge  of  linear  algebra  (including  nu¬ 
merical  linear  algebra)  and  the  standard  sequence  of  calculus  courses.  To  make  the  book  as 
self-contained  as  possible,  we  have  summarized  much  of  the  relevant  material  from  these  ar¬ 
eas  in  the  Appendix.  Our  experience  in  teaching  engineering  students  has  shown  us  that  the 
material  is  best  assimilated  when  combined  with  computer  programming  projects  in  which 
the  student  gains  a  good  feeling  for  the  algorithms — their  complexity,  memory  demands, 
and  elegance — and  for  the  applications.  In  most  chapters  we  provide  simple  computer 
exercises  that  require  only  minimal  programming  proficiency. 

Emphasis  and  Writing  Style 

We  have  used  a  conversational  style  to  motivate  the  ideas  and  present  the  numerical 
algorithms.  Rather  than  being  as  concise  as  possible,  our  aim  is  to  make  the  discussion  flow 
in  a  natural  way.  As  a  result,  the  book  is  comparatively  long,  but  we  believe  that  it  can  be 
read  relatively  rapidly.  The  instructor  can  assign  substantial  reading  assignments  from  the 
text  and  focus  in  class  only  on  the  main  ideas. 

A  typical  chapter  begins  with  a  nonrigorous  discussion  of  the  topic  at  hand,  including 
figures  and  diagrams  and  excluding  technical  details  as  far  as  possible.  In  subsequent  sections, 
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the  algorithms  are  motivated  and  discussed,  and  then  stated  explicitly.  The  major  theoretical 
results  are  stated,  and  in  many  cases  proved,  in  a  rigorous  fashion.  These  proofs  can  be 
skipped  by  readers  who  wish  to  avoid  technical  details. 

The  practice  of  optimization  depends  not  only  on  efficient  and  robust  algorithms, 
but  also  on  good  modeling  techniques,  careful  interpretation  of  results,  and  user-friendly 
software.  In  this  book  we  discuss  the  various  aspects  of  the  optimization  process — modeling, 
optimality  conditions,  algorithms,  implementation,  and  interpretation  of  results — but  not 
with  equal  weight.  Examples  throughout  the  book  show  how  practical  problems  are  formu¬ 
lated  as  optimization  problems,  but  our  treatment  of  modeling  is  light  and  serves  mainly 
to  set  the  stage  for  algorithmic  developments.  We  refer  the  reader  to  Dantzig  [86]  and 
Fourer,  Gay,  and  Kernighan  [112]  for  more  comprehensive  discussion  of  this  issue.  Our 
treatment  of  optimality  conditions  is  thorough  but  not  exhaustive;  some  concepts  are  dis¬ 
cussed  more  extensively  in  Mangasarian  [198]  and  Clarke  [62] .  As  mentioned  above,  we  are 
quite  comprehensive  in  discussing  optimization  algorithms. 

Topics  Not  Covered 

We  omit  some  important  topics,  such  as  network  optimization,  integer  programming, 
stochastic  programming,  nonsmooth  optimization,  and  global  optimization.  Network  and 
integer  optimization  are  described  in  some  excellent  texts:  for  instance,  Ahuja,  Magnanti,  and 
Orlin  [1]  in  the  case  of  network  optimization  and  Nemhauser  and  Wolsey  [224],  Papadim- 
itriou  and  Steiglitz  [235],  and  Wolsey  [312]  in  the  case  of  integer  programming.  Books  on 
stochastic  optimization  are  only  now  appearing;  we  mention  those  of  Kail  and  Wallace  [  174] , 
Birge  and  Louveaux  [22],  Nonsmooth  optimization  comes  in  many  flavors.  The  relatively 
simple  structures  that  arise  in  robust  data  fitting  (which  is  sometimes  based  on  the  l  y  norm) 
are  treated  by  Osborne  [232]  and  Fletcher  [101].  The  latter  book  also  discusses  algorithms 
for  nonsmooth  penalty  functions  that  arise  in  constrained  optimization;  we  discuss  these 
briefly,  too,  in  Chapter  18.  A  more  analytical  treatment  of  nonsmooth  optimization  is  given 
by  Hiriart-Urruty  and  Femarechal  [170].  We  omit  detailed  treatment  of  some  important 
topics  that  are  the  focus  of  intense  current  research,  including  interior-point  methods  for 
nonlinear  programming  and  algorithms  for  complementarity  problems. 

Additional  Resource 

The  material  in  the  book  is  complemented  by  an  online  resource  called  the  NEOS 
Guide,  which  can  be  found  on  the  World-Wide  Web  at 

http : //www.mcs . anl . gov/ ot c/Guide/ 

The  Guide  contains  information  about  most  areas  of  optimization,  and  presents  a  number 
of  case  studies  that  describe  applications  of  various  optimization  algorithms  to  real-world 
problems  such  as  portfolio  optimization  and  optimal  dieting.  Some  of  this  material  is 
interactive  in  nature  and  has  been  used  extensively  for  class  exercises. 
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For  the  most  part,  we  have  omitted  detailed  discussions  of  specific  software  packages, 
and  refer  the  reader  to  More  and  Wright  [217]  or  to  the  Software  Guide  section  of  the  NEOS 
Guide,  which  can  be  found  at 

http : //www.mcs . anl . gov/otc/Guide/SoftwareGuide/ 

Users  of  optimization  software  refer  in  great  numbers  to  this  web  site,  which  is  being 
constantly  updated  to  reflect  new  packages  and  changes  to  existing  software. 
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During  the  six  years  since  the  first  edition  of  this  book  appeared,  the  field  of  continuous 
optimization  has  continued  to  grow  and  evolve.  This  new  edition  reflects  a  better  under¬ 
standing  of  constrained  optimization  at  both  the  algorithmic  and  theoretical  levels,  and  of 
the  demands  imposed  by  practical  applications.  Perhaps  most  notably,  new  chapters  have 
been  added  on  two  important  topics:  derivative -free  optimization  (Chapter  9)  and  interior- 
point  methods  for  nonlinear  programming  (Chapter  19).  The  former  topic  has  proved  to 
be  of  great  interest  in  applications,  while  the  latter  topic  has  come  into  its  own  in  recent 
years  and  now  forms  the  basis  of  successful  codes  for  nonlinear  programming. 

Apart  from  the  new  chapters,  we  have  revised  and  updated  throughout  the  book, 
de-emphasizing  or  omitting  less  important  topics,  enhancing  the  treatment  of  subjects  of 
evident  interest,  and  adding  new  material  in  many  places.  The  first  part  (unconstrained  opti¬ 
mization)  has  been  comprehensively  reorganized  to  improve  clarity.  Discussion  of  Newton’s 
method — the  touchstone  method  for  unconstrained  problems — is  distributed  more  nat¬ 
urally  throughout  this  part  rather  than  being  isolated  in  a  single  chapter.  An  expanded 
discussion  of  large-scale  problems  appears  in  Chapter  7. 

Some  reorganization  has  taken  place  also  in  the  second  part  (constrained  optimiza¬ 
tion),  with  material  common  to  sequential  quadratic  programming  and  interior-point 
methods  now  appearing  in  the  chapter  on  fundamentals  of  nonlinear  programming 
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algorithms  (Chapter  15)  and  the  discussion  of  primal  barrier  methods  moved  to  the  new 
interior-point  chapter.  There  is  much  new  material  in  this  part,  including  a  treatment  of 
nonlinear  programming  duality,  an  expanded  discussion  of  algorithms  for  inequality  con¬ 
strained  quadratic  programming,  a  discussion  of  dual  simplex  and  presolving  in  linear 
programming,  a  summary  of  practical  issues  in  the  implementation  of  interior-point  linear 
programming  algorithms,  a  description  of  conjugate-gradient  methods  for  quadratic  pro¬ 
gramming,  and  a  discussion  of  filter  methods  and  nonsmooth  penalty  methods  in  nonlinear 
programming  algorithms. 

In  many  chapters  we  have  added  a  Perspectives  and  Software  section  near  the  end,  to 
place  the  preceding  discussion  in  context  and  discuss  the  state  of  the  art  in  software.  The 
appendix  has  been  rearranged  with  some  additional  topics  added,  so  that  it  can  be  used 
in  a  more  stand-alone  fashion  to  cover  some  of  the  mathematical  background  required 
for  the  rest  of  the  book.  The  exercises  have  been  revised  in  most  chapters.  After  these 
many  additions,  deletions,  and  changes,  the  second  edition  is  only  slightly  longer  than  the 
first,  reflecting  our  belief  that  careful  selection  of  the  material  to  include  and  exclude  is  an 
important  responsibility  for  authors  of  books  of  this  type. 

A  manual  containing  solutions  for  selected  problems  will  be  available  to  bona  fide 
instructors  through  the  publisher.  A  list  of  typos  will  be  maintained  on  the  book’s  web  site, 
which  is  accessible  from  the  web  pages  of  both  authors. 

We  acknowledge  with  gratitude  the  comments  and  suggestions  of  many  readers  of  the 
first  edition,  who  sent  corrections  to  many  errors  and  provided  valuable  perspectives  on  the 
material,  which  led  often  to  substantial  changes.  We  mention  in  particular  Frank  Curtis, 
Michael  Ferris,  Andreas  Griewank,  Jacek  Gondzio,  Sven  Leyffer,  Philip  Loewen,  Rembert 
Reemtsen,  and  David  Stewart. 
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read  various  chapters  of  the  new  edition  carefully  during  development,  including  Richard 
Byrd,  Nick  Gould,  Paul  Flovland,  Gabo  Lopez-Calva,  Long  Hei,  Katya  Scheinberg,  Andreas 
Wachter,  and  Richard  Waltz.  We  thank  Jill  Wright  for  improving  some  of  the  figures  and  for 
the  new  cover  graphic. 

We  mentioned  in  the  original  preface  several  areas  of  optimization  that  are  not 
covered  in  this  book.  During  the  past  six  years,  this  list  has  only  grown  longer,  as  the  field 
has  continued  to  expand  in  new  directions.  In  this  regard,  the  following  areas  are  particularly 
noteworthy:  optimization  problems  with  complementarity  constraints,  second-order  cone 
and  semidefinite  programming,  simulation-based  optimization,  robust  optimization,  and 
mixed-integer  nonlinear  programming.  All  these  areas  have  seen  theoretical  and  algorithmic 
advances  in  recent  years,  and  in  many  cases  developments  are  being  driven  by  new  classes 
of  applications.  Although  this  book  does  not  cover  any  of  these  areas  directly,  it  provides  a 
foundation  from  which  they  can  be  studied. 


Jorge  Nocedal  Stephen  J.  Wright 
Evanston,  IL  Madison,  WI 


Chapter 


Introduction 


People  optimize.  Investors  seek  to  create  portfolios  that  avoid  excessive  risk  while  achieving  a 
high  rate  of  return.  Manufacturers  aim  for  maximum  efficiency  in  the  design  and  operation 
of  their  production  processes.  Engineers  adjust  parameters  to  optimize  the  performance  of 
their  designs. 

Nature  optimizes.  Physical  systems  tend  to  a  state  of  minimum  energy.  The  molecules 
in  an  isolated  chemical  system  react  with  each  other  until  the  total  potential  energy  of  their 
electrons  is  minimized.  Rays  of  light  follow  paths  that  minimize  their  travel  time. 
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Optimization  is  an  important  tool  in  decision  science  and  in  the  analysis  of  physical 
systems.  To  make  use  of  this  tool,  we  must  first  identify  some  objective,  a  quantitative  measure 
of  the  performance  of  the  system  under  study.  This  objective  could  be  profit,  time,  potential 
energy,  or  any  quantity  or  combination  of  quantities  that  can  be  represented  by  a  single 
number.  The  objective  depends  on  certain  characteristics  of  the  system,  called  variables  or 
unknowns.  Our  goal  is  to  find  values  of  the  variables  that  optimize  the  objective.  Often  the 
variables  are  restricted,  or  constrained,  in  some  way.  For  instance,  quantities  such  as  electron 
density  in  a  molecule  and  the  interest  rate  on  a  loan  cannot  be  negative. 

The  process  of  identifying  objective,  variables,  and  constraints  for  a  given  problem  is 
known  as  modeling.  Construction  of  an  appropriate  model  is  the  first  step — sometimes  the 
most  important  step — in  the  optimization  process.  If  the  model  is  too  simplistic,  it  will  not 
give  useful  insights  into  the  practical  problem.  If  it  is  too  complex,  it  may  be  too  difficult  to 
solve. 

Once  the  model  has  been  formulated,  an  optimization  algorithm  can  be  used  to 
find  its  solution,  usually  with  the  help  of  a  computer.  There  is  no  universal  optimization 
algorithm  but  rather  a  collection  of  algorithms,  each  of  which  is  tailored  to  a  particular  type 
of  optimization  problem.  The  responsibility  of  choosing  the  algorithm  that  is  appropriate 
for  a  specific  application  often  falls  on  the  user.  This  choice  is  an  important  one,  as  it  may 
determine  whether  the  problem  is  solved  rapidly  or  slowly  and,  indeed,  whether  the  solution 
is  found  at  all. 

After  an  optimization  algorithm  has  been  applied  to  the  model,  we  must  be  able  to 
recognize  whether  it  has  succeeded  in  its  task  of  finding  a  solution.  In  many  cases,  there 
are  elegant  mathematical  expressions  known  as  optimality  conditions  for  checking  that  the 
current  set  of  variables  is  indeed  the  solution  of  the  problem.  If  the  optimality  conditions  are 
not  satisfied,  they  may  give  useful  information  on  how  the  current  estimate  of  the  solution 
can  be  improved.  The  model  may  be  improved  by  applying  techniques  such  as  sensitivity 
analysis,  which  reveals  the  sensitivity  of  the  solution  to  changes  in  the  model  and  data. 
Interpretation  of  the  solution  in  terms  of  the  application  may  also  suggest  ways  in  which  the 
model  can  be  refined  or  improved  (or  corrected).  If  any  changes  are  made  to  the  model,  the 
optimization  problem  is  solved  anew,  and  the  process  repeats. 

MATHEMATICAL  FORMULATION 

Mathematically  speaking,  optimization  is  the  minimization  or  maximization  of  a 
function  subject  to  constraints  on  its  variables.  We  use  the  following  notation: 

-  x  is  the  vector  of  variables,  also  called  unknowns  or  parameters; 

-  f  is  the  objective  function,  a  (scalar)  function  of  x  that  we  want  to  maximize  or 

minimize; 

-  c;  are  constraint  functions,  which  are  scalar  functions  of. a:  that  define  certain  equations 

and  inequalities  that  the  unknown  vector  x  must  satisfy. 
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Using  this  notation,  the  optimization  problem  can  be  written  as  follows: 


Ci{x)  =  0,  i  e  £, 

min  f(x)  subject  to 

*eR"  c,  (x)  >  0,  iel. 

Here  X  and  £  are  sets  of  indices  for  equality  and  inequality  constraints,  respectively. 
As  a  simple  example,  consider  the  problem 

,  ,  x\  —  x2  <  0, 

min  (x\  —  2)  +  (x2  —  1)  subject  to 

Xi  +x2  <  2. 


We  can  write  this  problem  in  the  form  ( 1.1)  by  defining 


(1-1) 


(1.2) 


fix)  =  (xi  -  2)2  +  (x2  -  l)2, 


c(x)  - 


Clix) 

-r2  +  x2 

c2(.r) 

—X\  -  x2  +  2 

I  ={1,2},  5  =  0. 


Figure  1.1  shows  the  contours  of  the  objective  function,  that  is,  the  set  of  points  for  which 
fix)  has  a  constant  value.  It  also  illustrates  the  feasible  region ,  which  is  the  set  of  points 
satisfying  all  the  constraints  (the  area  between  the  two  constraint  boundaries),  and  the  point 
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x*,  which  is  the  solution  of  the  problem.  Note  that  the  “infeasible  side”  of  the  inequality 
constraints  is  shaded. 

The  example  above  illustrates,  too,  that  transformations  are  often  necessary  to  express 
an  optimization  problem  in  the  particular  form  (1.1).  Often  it  is  more  natural  or  convenient 
to  label  the  unknowns  with  two  or  three  subscripts,  or  to  refer  to  different  variables  by 
completely  different  names,  so  that  relabeling  is  necessary  to  pose  the  problem  in  the  form 
(1.1).  Another  common  difference  is  that  we  are  required  to  maximize  rather  than  minimize 
/,  but  we  can  accommodate  this  change  easily  by  minimizing  —  f  in  the  formulation  (1.1). 
Good  modeling  systems  perform  the  conversion  to  standardized  formulations  such  as  (1.1) 
transparently  to  the  user. 


EXAMPLE:  A  TRANSPORTATION  PROBLEM 

We  begin  with  a  much  simplified  example  of  a  problem  that  might  arise  in  manufac¬ 
turing  and  transportation.  A  chemical  company  has  2  factories  F\  and  F2  and  a  dozen  retail 
outlets  Ri,  /?2,  •  •  • ,  Ru-  Each  factory  F,  can  produce  a ;  tons  of  a  certain  chemical  product 
each  week;  £/,■  is  called  the  capacity  of  the  plant.  Each  retail  outlet  Rj  has  a  known  weekly 
demand  of  bj  tons  of  the  product.  The  cost  of  shipping  one  ton  of  the  product  from  factory 
Fj  to  retail  outlet  Rj  is  aj. 

The  problem  is  to  determine  how  much  of  the  product  to  ship  from  each  factory 
to  each  outlet  so  as  to  satisfy  all  the  requirements  and  minimize  cost.  The  variables  of  the 
problem  are  x,y,  i  =  1,2,  7  =  1, . . . ,  12,  where  Xjj  is  the  number  of  tons  of  the  product 
shipped  from  factory  Ft  to  retail  outlet  Rj;  see  Figure  1.2.  We  can  write  the  problem  as 


min  X!  ciJx‘j 

(1.3a) 

ij 

12 

subject  to  Xij  <  ai,  i  —  1,2, 

(1.3b) 

j=  1 

2 

(N 

II 

AI 

Wi 

(1.3c) 

Xij  >  0,  (  =  1,2,  7  =  1,...,  12. 

(1.3d) 

This  type  of  problem  is  known  as  a  linear  programming  problem,  since  the  objective  function 
and  the  constraints  are  all  linear  functions.  In  a  more  practical  model,  we  would  also  include 
costs  associated  with  manufacturing  and  storing  the  product.  There  maybe  volume  discounts 
in  practice  for  shipping  the  product;  for  example  the  cost  (1.3a)  could  be  represented  by 
J2ij  cij  y/&  +  Xi j  >  where  8  >  0  is  a  small  subscription  fee.  In  this  case,  the  problem  is  a 
nonlinear  program  because  the  objective  function  is  nonlinear. 
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Figure  1.2  A  transportation  problem. 


CONTINUOUS  VERSUS  DISCRETE  OPTIMIZATION 

In  some  optimization  problems  the  variables  make  sense  only  if  they  take  on  integer 
values.  For  example,  a  variable  x,-  could  represent  the  number  of  power  plants  of  type  i 
that  should  be  constructed  by  an  electicity  provider  during  the  next  5  years,  or  it  could 
indicate  whether  or  not  a  particular  factory  should  be  located  in  a  particular  city.  The 
mathematical  formulation  of  such  problems  includes  integrality  constraints,  which  have 
the  form  x,-  e  Z,  where  Z  is  the  set  of  integers,  or  binary  constraints,  which  have  the  form 
x;  e  {0,  1},  in  addition  to  algebraic  constraints  like  those  appearing  in  (1.1).  Problems  of 
this  type  are  called  integer  programming  problems.  If  some  of  the  variables  in  the  problem 
are  not  restricted  to  be  integer  or  binary  variables,  they  are  sometimes  called  mixed  integer 
programming  problems,  or  MIPs  for  short. 

Integer  programming  problems  are  a  type  of  discrete  optimization  problem.  Generally, 
discrete  optimization  problems  may  contain  not  only  integers  and  binary  variables,  but  also 
more  abstract  variable  objects  such  as  permutations  of  an  ordered  set.  The  defining  feature 
of  a  discrete  optimization  problem  is  that  the  unknown  x  is  drawn  from  a  a  finite  (but  often 
very  large)  set.  By  contrast,  the  feasible  set  for  continuous  optimization  problems — the  class 
of  problems  studied  in  this  book — is  usually  uncountably  infinite,  as  when  the  components 
of  x  are  allowed  to  be  real  numbers.  Continuous  optimization  problems  are  normally  easier 
to  solve  because  the  smoothness  of  the  functions  makes  it  possible  to  use  objective  and 
constraint  information  at  a  particular  point  x  to  deduce  information  about  the  function’s 
behavior  at  all  points  close  to  x.  In  discrete  problems,  by  constrast,  the  behavior  of  the 
objective  and  constraints  may  change  significantly  as  we  move  from  one  feasible  point  to 
another,  even  if  the  two  points  are  “close”  by  some  measure.  The  feasible  sets  for  discrete 
optimization  problems  can  be  thought  of  as  exhibiting  an  extreme  form  of  nonconvexity,  as 
a  convex  combination  of  two  feasible  points  is  in  general  not  feasible. 
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Discrete  optimization  problems  are  not  addressed  directly  in  this  book;  we  refer  the 
reader  to  the  texts  by  Papadimitriou  and  Steiglitz  [235],  Nemhauser  and  Wolsey  [224],  Cook 
et  al.  [77] ,  and  Wolsey  [312]  for  comprehensive  treatments  of  this  subject.  We  note,  however, 
that  continuous  optimization  techniques  often  play  an  important  role  in  solving  discrete 
optimization  problems.  For  instance,  the  branch-and-bound  method  for  integer  linear 
programming  problems  requires  the  repeated  solution  of  linear  programming  “relaxations,” 
in  which  some  of  the  integer  variables  are  fixed  at  integer  values,  while  for  other  integer 
variables  the  integrality  constraints  are  temporarily  ignored.  These  subproblems  are  usually 
solved  by  the  simplex  method,  which  is  discussed  in  Chapter  13  of  this  book. 

CONSTRAINED  AND  UNCONSTRAINED  OPTIMIZATION 

Problems  with  the  general  form  (1.1)  can  be  classified  according  to  the  nature  of  the 
objective  function  and  constraints  (linear,  nonlinear,  convex),  the  number  ofvariables  (large 
or  small),  the  smoothness  of  the  functions  (differentiable  or  nondifferentiable),  and  so  on. 
An  important  distinction  is  between  problems  that  have  constraints  on  the  variables  and 
those  that  do  not.  This  book  is  divided  into  two  parts  according  to  this  classification. 

Unconstrained  optimization  problems,  for  which  we  have  8  =  1  =  0  in  (1.1),  arise 
directly  in  many  practical  applications.  Even  for  some  problems  with  natural  constraints 
on  the  variables,  it  may  be  safe  to  disregard  them  as  they  do  not  affect  on  the  solution  and 
do  not  interfere  with  algorithms.  Unconstrained  problems  arise  also  as  reformulations  of 
constrained  optimization  problems,  in  which  the  constraints  are  replaced  by  penalization 
terms  added  to  objective  function  that  have  the  effect  of  discouraging  constraint  violations. 

Constrained  optimization  problems  arise  from  models  in  which  constraints  play  an 
essential  role,  for  example  in  imposing  budgetary  constraints  in  an  economic  problem  or 
shape  constraints  in  a  design  problem.  These  constraints  may  be  simple  bounds  such  as 
0  <  X\  <  100,  more  general  linear  constraints  such  as  JT  Xj  <  1,  or  nonlinear  inequalities 
that  represent  complex  relationships  among  the  variables. 

When  the  objective  function  and  all  the  constraints  are  linear  functions  of  x,  the 
problem  is  a  linear  programming  problem.  Problems  of  this  type  are  probably  the  most 
widely  formulated  and  solved  of  all  optimization  problems,  particularly  in  management, 
financial,  and  economic  applications.  Nonlinear  programming  problems,  in  which  at  least 
some  of  the  constraints  or  the  objective  are  nonlinear  functions,  tend  to  arise  naturally  in 
the  physical  sciences  and  engineering,  and  are  becoming  more  widely  used  in  management 
and  economic  sciences  as  well. 

GLOBAL  AND  LOCAL  OPTIMIZATION 

Many  algorithms  for  nonlinear  optimization  problems  seek  only  a  local  solution,  a 
point  at  which  the  objective  function  is  smaller  than  at  all  other  feasible  nearby  points.  They 
do  not  always  find  the  global  solution,  which  is  the  point  with  lowest  function  value  among  all 
feasible  points.  Global  solutions  are  needed  in  some  applications,  but  for  many  problems  they 
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are  difficult  to  recognize  and  even  more  difficult  to  locate.  For  convex  programming-problems, 
and  more  particularly  for  linear  programs,  local  solutions  are  also  global  solutions.  General 
nonlinear  problems,  both  constrained  and  unconstrained,  may  possess  local  solutions  that 
are  not  global  solutions. 

In  this  book  we  treat  global  optimization  only  in  passing  and  focus  instead  on  the 
computation  and  characterization  of  local  solutions.  We  note,  however,  that  many  successful 
global  optimization  algorithms  require  the  solution  of  many  local  optimization  problems, 
to  which  the  algorithms  described  in  this  book  can  be  applied. 

Research  papers  on  global  optimization  can  be  found  in  Floudas  and  Pardalos  [109] 
and  in  the  Journal  of  Global  Optimization. 

STOCHASTIC  AND  DETERMINISTIC  OPTIMIZATION 

In  some  optimization  problems,  the  model  cannot  be  fully  specified  because  it  depends 
on  quantities  that  are  unknown  at  the  time  of  formulation.  This  characteristic  is  shared  by 
many  economic  and  financial  planning  models,  which  may  depend  for  example  on  future 
interest  rates,  future  demands  for  a  product,  or  future  commodity  prices,  but  uncertainty 
can  arise  naturally  in  almost  any  type  of  application. 

Rather  than  just  use  a  “best  guess”  for  the  uncertain  quantities,  modelers  may  obtain 
more  useful  solutions  by  incorporating  additional  knowledge  about  these  quantities  into 
the  model.  For  example,  they  may  know  a  number  of  possible  scenarios  for  the  uncertain 
demand,  along  with  estimates  of  the  probabilities  of  each  scenario.  Stochastic  optimization 
algorithms  use  these  quantifications  of  the  uncertainty  to  produce  solutions  that  optimize 
the  expected  performance  of  the  model. 

Related  paradigms  for  dealing  with  uncertain  data  in  the  model  include  chance- 
constrained  optimization,  in  which  we  ensure  that  the  variables  x  satisfy  the  given  constraints 
to  some  specified  probability,  and  robust  optimization,  in  which  certain  constraints  are 
required  to  hold  for  all  possible  values  of  the  uncertain  data. 

We  do  not  consider  stochastic  optimization  problems  further  in  this  book,  focusing 
instead  on  deterministic  optimization  problems,  in  which  the  model  is  completely  known. 
Many  algorithms  for  stochastic  optimization  do,  however,  proceed  by  formulating  one  or 
more  deterministic  subproblems,  each  of  which  can  be  solved  by  the  techniques  outlined 
here. 

Stochastic  and  robust  optimization  have  seen  a  great  deal  of  recent  research  activity. 
For  further  information  on  stochastic  optimization,  consult  the  books  of  Birge  and 
Louveaux  [22]  and  Kail  and  Wallace  [174].  Robust  optimization  is  discussed  in  Ben-Tal 
and  Nemirovski  [15]. 

CONVEXITY 

The  concept  of  convexity  is  fundamental  in  optimization.  Many  practical  problems 
possess  this  property,  which  generally  makes  them  easier  to  solve  both  in  theory  and  practice. 
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The  term  “convex”  can  be  applied  both  to  sets  and  to  functions.  A  set  S  e  R"  is  a 
convex  set  if  the  straight  line  segment  connecting  any  two  points  in  S  lies  entirely  inside  S. 
Formally,  for  any  two  points  x  e  S  and  y  e  S,  we  have  ax  +  ( 1  —  a)y  e  S  for  all  a  e  [0,  1  ] . 
The  function  /  is  a  convex  function  if  its  domain  S  is  a  convex  set  and  if  for  any  two  points 
x  and  y  in  S,  the  following  property  is  satisfied: 

f(ax  +  (1  —  a)y)  <  af(x)  +  (1  —  a)f(y),  for  alia  e  [0,  1].  (1.4) 

Simple  instances  of  convex  sets  include  the  unit  ball  {y  e  R"  |  ||  y  || 2  <  1};  and  any 
polyhedron,  which  is  a  set  defined  by  linear  equalities  and  inequalities,  that  is, 

{.t  e  R'!  |  Ax  =  b ,  Cx  <  d}, 

where  A  and  C  are  matrices  of  appropriate  dimension,  and  b  and  d  are  vectors.  Simple 
instances  of  convex  functions  include  the  linear  function  f(x)  —  cTx  +  a,  for  any  constant 
vector  c  e  R"  and  scalar  a;  and  the  convex  quadratic  function  f(x)—  xT Hx,  where  H  is 
a  symmetric  positive  semidefinite  matrix. 

We  say  that  /  is  strictly  convex  if  the  inequality  in  (1.4)  is  strict  whenever  r/)'  and 
a  is  in  the  open  interval  (0,  1).  A  function  /  is  said  to  be  concave  if  —  /  is  convex. 

If  the  objective  function  in  the  optimization  problem  (1.1)  and  the  feasible  region  are 
both  convex,  then  any  local  solution  of  the  problem  is  in  fact  a  global  solution. 

The  term  convex  programming  is  used  to  describe  a  special  case  of  the  general 
constrained  optimization  problem  (1.1)  in  which 

•  the  objective  function  is  convex, 

•  the  equality  constraint  functions  c,(-),  i  e  £,  are  linear,  and 

•  the  inequality  constraint  functions  c,(-)>  i  e  X,  are  concave. 


OPTIMIZATION  ALGORITHMS 

Optimization  algorithms  are  iterative.  They  begin  with  an  initial  guess  of  the  variable 
x  and  generate  a  sequence  of  improved  estimates  (called  “iterates”)  until  they  terminate, 
hopefully  at  a  solution.  The  strategy  used  to  move  from  one  iterate  to  the  next  distinguishes 
one  algorithm  from  another.  Most  strategies  make  use  of  the  values  of  the  objective  function 
/,  the  constraint  functions  c,- ,  and  possibly  the  first  and  second  derivatives  of  these  functions. 
Some  algorithms  accumulate  information  gathered  at  previous  iterations,  while  others  use 
only  local  information  obtained  at  the  current  point.  Regardless  of  these  specifics  (which 
will  receive  plenty  of  attention  in  the  rest  of  the  book),  good  algorithms  should  possess  the 
following  properties: 

•  Robustness.  They  should  perform  well  on  a  wide  variety  of  problems  in  their  class, 
for  all  reasonable  values  of  the  starting  point. 


Chapter  1.  Introduction  9 


•  Efficiency.  They  should  not  require  excessive  computer  time  or  storage. 

•  Accuracy.  They  should  be  able  to  identify  a  solution  with  precision,  without  being 
overly  sensitive  to  errors  in  the  data  or  to  the  arithmetic  rounding  errors  that  occur 
when  the  algorithm  is  implemented  on  a  computer. 

These  goals  may  conflict.  For  example,  a  rapidly  convergent  method  for  a  large  uncon¬ 
strained  nonlinear  problem  may  require  too  much  computer  storage.  On  the  other  hand, 
a  robust  method  may  also  be  the  slowest.  Tradeoffs  between  convergence  rate  and  storage 
requirements,  and  between  robustness  and  speed,  and  so  on,  are  central  issues  in  numerical 
optimization.  They  receive  careful  consideration  in  this  book. 

The  mathematical  theory  of  optimization  is  used  both  to  characterize  optimal  points 
and  to  provide  the  basis  for  most  algorithms.  It  is  not  possible  to  have  a  good  understanding 
of  numerical  optimization  without  a  firm  grasp  of  the  supporting  theory.  Accordingly, 
this  book  gives  a  solid  (though  not  comprehensive)  treatment  of  optimality  conditions,  as 
well  as  convergence  analysis  that  reveals  the  strengths  and  weaknesses  of  some  of  the  most 
important  algorithms. 

NOTES  AND  REFERENCES 

Optimization  traces  its  roots  to  the  calculus  of  variations  and  the  work  of  Euler  and 
Lagrange.  The  development  of  linear  programming  n  the  1940s  broadened  the  field  and 
stimulated  much  of  the  progress  in  modern  optimization  theory  and  practice  during  the 
past  60  years. 

Optimization  is  often  called  mathematical  programming,  a  somewhat  confusing  term 
coined  in  the  1940s,  before  the  word  “programming”  became  inextricably  linked  with 
computer  software.  The  original  meaning  of  this  word  (and  the  intended  one  in  this  context) 
was  more  inclusive,  with  connotations  of  algorithm  design  and  analysis. 

Modeling  will  not  be  treated  extensively  in  the  book.  It  is  an  essential  subject  in  its 
own  right,  as  it  makes  the  connection  between  optimization  algorithms  and  software  on 
the  one  hand,  and  applications  on  the  other  hand.  Information  about  modeling  techniques 
for  various  application  areas  can  be  found  in  Dantzig  [86] ,  Ahuja,  Magnanti,  and  Orlin  [  1  ] , 
Fourer,  Gay,  and  Kernighan  [112],  Winston  [308],  and  Rardin  [262], 
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In  unconstrained  optimization,  we  minimize  an  objective  function  that  depends  on  real 
variables,  with  no  restrictions  at  all  on  the  values  of  these  variables.  The  mathematical 
formulation  is 


min  fix), 

X 


(2.1) 


where  x  e  R"  is  a  real  vector  with  n  >  1  components  and  /  :  R' 
function. 


R  is  a  smooth 
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Figure  2.1  Least  squares  data  fitting  problem. 


Usually,  we  lack  a  global  perspective  on  the  function  /.  All  we  know  are  the  values 

of  /  and  maybe  some  of  its  derivatives  at  a  set  of  points  Xq,  X\ ,  x2, _ Fortunately,  our 

algorithms  get  to  choose  these  points,  and  they  try  to  do  so  in  a  way  that  identifies  a  solution 
reliably  and  without  using  too  much  computer  time  or  storage.  Often,  the  information 
about  /  does  not  come  cheaply,  so  we  usually  prefer  algorithms  that  do  not  call  for  this 
information  unnecessarily. 


□  Example  2.1 

Suppose  that  we  are  trying  to  find  a  curve  that  fits  some  experimental  data.  Figure  2.1 
plots  measurements  yi,y2, ... ,  )’m  of  a  signal  taken  at  times  h,t2, ...  ,tm.  From  the  data  and 
our  knowledge  of  the  application,  we  deduce  that  the  signal  has  exponential  and  oscillatory 
behavior  of  certain  types,  and  we  choose  to  model  it  by  the  function 

4>(t\  x)  —  X\  +  x2e~<'Xi~,'>  ^Xi  +  Xs  cos(.V6 1). 

The  real  numbers  x,-,  i  =  1,  2,  . . . ,  6,  are  the  parameters  of  the  model;  we  would  like  to 
choose  them  to  make  the  model  values  <p(tj\  x)  fit  the  observed  data  y;-  as  closely  as  possible. 
To  state  our  objective  as  an  optimization  problem,  we  group  the  parameters  x,  into  a  vector 
of  unknowns  x  =  (xi,  x2,  ■ .  ■ ,  x$)T ,  and  define  the  residuals 

rj(x)  =  yj  -  <t>(tj\x),  j  —  1,2, ...  ,m,  (2.2) 


which  measure  the  discrepancy  between  the  model  and  the  observed  data.  Our  estimate  of 
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x  will  be  obtained  by  solving  the  problem 

min  f{x)  =  r*(x)  +  r\{x)  H - h  r„(x).  (2.3) 

xeR6 

This  is  a  nonlinear  least-squares  problem,  a  special  case  of  unconstrained  optimization. 
It  illustrates  that  some  objective  functions  can  be  expensive  to  evaluate  even  when  the 
number  of  variables  is  small.  Here  we  have  n  —  6,  but  if  the  number  of  measurements 
m  is  large  (105,  say),  evaluation  of  fix)  for  a  given  parameter  vector  x  is  a  significant 
computation. 


Suppose  that  for  the  data  given  in  Figure  2.1  the  optimal  solution  of  (2.3)  is  ap¬ 
proximately  x*  =  (1.1,  0.01,  1.2,  1.5,  2.0,  1.5)  and  the  corresponding  function  value  is 
fix*)  —  0.34.  Because  the  optimal  objective  is  nonzero,  there  must  be  discrepancies  be¬ 
tween  the  observed  measurements  y;  and  the  model  predictions  <pitj ,  x*)  for  some  (usually 
most)  values  of  j — the  model  has  not  reproduced  all  the  data  points  exactly.  How,  then, 
can  we  verify  that  x*  is  indeed  a  minimizer  of  /?  To  answer  this  question,  we  need  to 
define  the  term  “solution”  and  explain  how  to  recognize  solutions.  Only  then  can  we  discuss 
algorithms  for  unconstrained  optimization  problems. 


2.1  WHAT  IS  A  SOLUTION? 


Generally,  we  would  be  happiest  if  we  found  a  global  minimizer  of  /,  a  point  where  the 
function  attains  its  least  value.  A  formal  definition  is 

A  point  x*  is  a  global  minimizer  if  fix*)  <  fix)  for  all  x, 

where  x  ranges  over  all  of  R"  (or  at  least  over  the  domain  of  interest  to  the  modeler).  The 
global  minimizer  can  be  difficult  to  find,  because  our  knowledge  of  /  is  usually  only  local. 
Since  our  algorithm  does  not  visit  many  points  (we  hope!),  we  usually  do  not  have  a  good 
picture  of  the  overall  shape  of  /,  and  we  can  never  be  sure  that  the  function  does  not  take  a 
sharp  dip  in  some  region  that  has  not  been  sampled  by  the  algorithm.  Most  algorithms  are 
able  to  find  only  a  local  minimizer,  which  is  a  point  that  achieves  the  smallest  value  of  /  in 
its  neighborhood.  Formally,  we  say: 

A  point  x*  is  a  local  minimizer  if  there  is  a  neighborhood  Af  of  x*  such  that  fix*)  < 
fix)  for  all  x  e  J\f. 

(Recall  that  a  neighborhood  of  x*  is  simply  an  open  set  that  contains  x* . )  A  point  that  satisfies 
this  definition  is  sometimes  called  a  weak  local  minimizer.  This  terminology  distinguishes 
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it  from  a  strict  local  minimizer,  which  is  the  outright  winner  in  its  neighborhood. 
Formally, 

A  point  x *  is  a  strict  local  minimizer  (also  called  a  strong  local  minimizer)  if  there  is  a 
neighborhood  Af  of  x*  such  that  fix*)  <  f{x)  for  all  x  €  Af  with  x  /  x* . 

For  the  constant  function  fix)  =  2,  every  point  x  is  a  weak  local  minimizer,  while  the 
function  fix)  =  (x  —  2)4  has  a  strict  local  minimizer  at  x  =  2. 

A  slightly  more  exotic  type  of  local  minimizer  is  defined  as  follows. 

A  point  x *  is  an  isolated  local  minimizer  if  there  is  a  neighborhood  A F  of  x*  such  that 
x*  is  the  only  local  minimizer  in  A f. 

Some  strict  local  minimizers  are  not  isolated,  as  illustrated  by  the  function 
fix)  —  x4  cos(l/x)  +  2x4,  f(0)  —  0, 

which  is  twice  continuously  differentiable  and  has  a  strict  local  minimizer  at  x*  =  0. 
However,  there  are  strict  local  minimizers  at  many  nearby  points  Xj ,  and  we  can  label  these 
points  so  that  Xj  ->  0  as  j  — >  oo. 

While  strict  local  minimizers  are  not  always  isolated,  it  is  true  that  all  isolated  local 
minimizers  are  strict. 

Figure  2.2  illustrates  a  function  with  many  local  minimizers.  It  is  usually  difficult 
to  find  the  global  minimizer  for  such  functions,  because  algorithms  tend  to  be  “trapped” 
at  local  minimizers.  This  example  is  by  no  means  pathological.  In  optimization  problems 
associated  with  the  determination  of  molecular  conformation,  the  potential  function  to  be 
minimized  may  have  millions  of  local  minima. 


Figure  2.2  A  difficult  case  for  global  minimization. 
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Sometimes  we  have  additional  “global”  knowledge  about  /  that  may  help  in  identi¬ 
fying  global  minima.  An  important  special  case  is  that  of  convex  functions,  for  which  every 
local  minimizer  is  also  a  global  minimizer. 


RECOGNIZING  A  LOCAL  MINIMUM 

From  the  definitions  given  above,  it  might  seem  that  the  only  way  to  find  out  whether 
a  point  x*  is  a  local  minimum  is  to  examine  all  the  points  in  its  immediate  vicinity,  to 
make  sure  that  none  of  them  has  a  smaller  function  value.  When  the  function  /  is  smooth, 
however,  there  are  more  efficient  and  practical  ways  to  identify  local  minima.  In  particular,  if 
/  is  twice  continuously  differentiable,  we  may  be  able  to  tell  that  x*  is  a  local  minimizer  (and 
possibly  a  strict  local  minimizer)  by  examining  just  the  gradient  V/(x*)  and  the  Hessian 

v2/(x*). 

The  mathematical  tool  used  to  study  minimizers  of  smooth  functions  is  Taylor’s 
theorem.  Because  this  theorem  is  central  to  our  analysis  throughout  the  book,  we  state  it 
now.  Its  proof  can  be  found  in  any  calculus  textbook. 

Theorem  2.1  (Taylor’s  Theorem). 

Suppose  that  f  :  R'!  ->  R  is  continuously  differentiable  and  that  p  e  R".  Then  we  have 

that 


fix  +  p )  =  f(x)  +  V  fix  +  tp)Tp, 


(2.4) 


for  some  t  e  (0,  1).  Moreover,  if  f  is  twice  continuously  differentiable,  we  have  that 


V  fix  +  p)  = 


St2  fix  +  tp)pdt. 


(2.5) 


and  that 


fix  +  p)  =  fix)  +  V  fix)7  p  +  \pTV2f{x  +  tp)p,  (2.6) 

for  some  t  e  (0,  1). 

Necessary  conditions  for  optimality  are  derived  by  assuming  that  x*  is  a  local  minimizer 
and  then  proving  facts  about  V/(x*)  and  V2/(x*). 

Theorem  2.2  (First-Order  Necessary  Conditions). 

If  x*  is  a  local  minimizer  and  f  is  continuously  differentiable  in  an  open  neighborhood 
ofx*,  then  V  fix*)  =  0. 
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PROOF.  Suppose  for  contradiction  that  V/(x*)  ^  0.  Define  the  vector  p  —  —  V  fix*)  and 
note  that  prV/(x*)  =  —  ||V/(x*)||2  <  0.  Because  V/  is  continuous  near  x*,  there  is  a 
scalar  T  >  0  such  that 


pTV  fix*  +  tp )  <  0,  for  all  t  e  [0,  T], 

For  any  t  e  (0,  T],we  have  by  Taylor’s  theorem  that 

fix*  +  tp)  —  f(x*)  +  tpTV  f{x*  +  tp),  for  some  t  e  (0,  t). 

Therefore,  fix*  +  tp)  <  fix*)  for  all  t  e  (0 ,T],  We  have  found  a  direction  leading 
away  from  x*  along  which  f  decreases,  so  x*  is  not  a  local  minimizer,  and  we  have  a 
contradiction.  □ 


We  call  x*  a  stationary  point  if  V  fix*)  —  0.  According  to  Theorem  2.2,  any  local 
minimizer  must  be  a  stationary  point. 

For  the  next  result  we  recall  that  a  matrix  B  is  positive  definite  if  pT  Bp  >  0  for  all 
p  0,  and  positive  semidefinite  if  pT  Bp  >  0  for  all  p  (see  the  Appendix). 


Theorem  2.3  (Second-Order  Necessary  Conditions). 

Ifx*  is  a  local  minimizer  off  and  V2  f  exists  and  is  continuous  in  an  open  neighborhood 
ofx*,  then  V  fix*)  =  0  andV2  fix*)  is  positive  semidefinite. 

Proof.  We  know  from  Theorem  2.2  that  V fix*)  =  0.  For  contradiction,  assume 
that  V2/(x*)  is  not  positive  semidefinite.  Then  we  can  choose  a  vector  p  such  that 
pTV2fix*)p  <  0,  and  because  V2/  is  continuous  near  x*,  there  is  a  scalar  T  >  0 
such  that  pT  V2/(x*  +  tp)p  <  0  for  all  t  e  [0,  T], 

By  doing  a  Taylor  series  expansion  around  x*,  we  have  for  all  f  e  (0,  T]  and  some 
t  e  (0,  t)  that 

fix*  +  tp)  =  fix*)  +  tpTV fix*)  +  \t1pTV2fix*  +  tp)p  <  fix*). 

As  in  Theorem  2.2,  we  have  found  a  direction  from  x*  along  which  /  is  decreasing,  and  so 
again,  x*  is  not  a  local  minimizer.  □ 


We  now  describe  sufficient  conditions,  which  are  conditions  on  the  derivatives  of  /  at 
the  point  z*  that  guarantee  that  x*  is  a  local  minimizer. 
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Theorem  2.4  (Second-Order  Sufficient  Conditions). 

Suppose  that  V2/  is  continuous  in  an  open  neighborhood  of  x*  and  that  V  fix*)  =  0 
andV2  fix*)  is  positive  definite.  Thenx*  is  a  strict  local  minimizer  of  f . 

PROOF.  Because  the  Hessian  is  continuous  and  positive  definite  at  x* ,  we  can  choose  a  radius 
r  >  0  so  that  V2/(x)  remains  positive  definite  for  allx  in  the  open  ball  V  —  [z  |  ||z  —  x*||  < 
r}.  Taking  any  nonzero  vector  p  with  ||p||  <  r,  we  have  x*  +  p  e  V  and  so 

f(x*  +  p)  =  fix*)  +  PTV.fix *)  +  \pTV2f{z)p 
=  fix*)+\pTV2fiz)p, 

where  z  =  x*  +  tp  for  some  t  e  (0,  1).  Since  z  e  V,  we  have  pTV2  fiz)p  >  0,  and  therefore 
fix*  +  p)  >  fix*),  giving  the  result.  □ 

Note  that  the  second-order  sufficient  conditions  of  Theorem  2.4  guarantee  something 
stronger  than  the  necessary  conditions  discussed  earlier;  namely,  that  the  minimizer  is  a  strict 
local  minimizer.  Note  too  that  the  second-order  sufficient  conditions  are  not  necessary:  A 
point  x*  may  be  a  strict  local  minimizer,  and  yet  may  fail  to  satisfy  the  sufficient  conditions. 
A  simple  example  is  given  by  the  function  fix)  —  x4,  for  which  the  point  x*  =  0  is  a 
strict  local  minimizer  at  which  the  Hessian  matrix  vanishes  (and  is  therefore  not  positive 
definite). 

When  the  objective  function  is  convex,  local  and  global  minimizers  are  simple  to 
characterize. 

Theorem  2.5. 

When  f  is  convex,  any  local  minimizer  x*  is  a  global  minimizer  off.  If  in  addition  f  is 
differentiable,  then  any  stationary  point  x*  is  a  global  minimizer  off. 

PROOF.  Suppose  that  x*  is  a  local  but  not  a  global  minimizer.  Then  we  can  find  a  point 
Z  e  1R"  with  /(z)  <  fix*).  Consider  the  line  segment  that  joins  x*  to  z,  that  is, 

x  =  kz  +  (1  —  A.)x*,  for  some  X  e  (0,  1].  (2.7) 

By  the  convexity  property  for  /,  we  have 

fix)  <  ’kfiz)  +  (1  -  I.)  fix*)  <  fix*).  (2.8) 

Any  neighborhood  Af  of  x*  contains  a  piece  of  the  line  segment  (2.7),  so  there  will  always 
be  points  x  e  Af  at  which  (2.8)  is  satisfied.  Hence,  x*  is  not  a  local  minimizer. 
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For  the  second  part  of  the  theorem,  suppose  that  x*  is  not  a  global  minimizer  and 
choose  z  as  above.  Then,  from  convexity,  we  have 

V  fix*)7  iz  -  x*)  =  f{x *  +  X(z  -  x*))  U=0  (see  the  Appendix) 
dX 

f(x*  +  k(z-x*))- fix*) 

—  iim - 

A.4,0  k 

<  lim - 

zj.o  X 

=  fiz )  -  fix*)  <  0. 

Therefore,  V  fix*)  f-  0,  and  so  x*  is  not  a  stationary  point.  □ 

These  results,  which  are  based  on  elementary  calculus,  provide  the  foundations  for 
unconstrained  optimization  algorithms.  In  one  way  or  another,  all  algorithms  seek  a  point 
where  V/(-)  vanishes. 

NONSMOOTH  PROBLEMS 

This  book  focuses  on  smooth  functions,  by  which  we  generally  mean  functions  whose 
second  derivatives  exist  and  are  continuous.  We  note,  however,  that  there  are  interesting 
problems  in  which  the  functions  involved  may  be  nonsmooth  and  even  discontinuous.  It  is 
not  possible  in  general  to  identify  a  minimizer  of  a  general  discontinuous  function.  If,  how¬ 
ever,  the  function  consists  of  a  few  smooth  pieces,  with  discontinuities  between  the  pieces, 
it  may  be  possible  to  find  the  minimizer  by  minimizing  each  smooth  piece  individually. 

If  the  function  is  continuous  everywhere  but  nondifferentiable  at  certain  points, 
as  in  Figure  2.3,  we  can  identify  a  solution  by  examing  the  subgradient  or  generalized 


Figure  2.3  Nonsmooth  function  with  minimum  at  a  kink. 
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gradient,  which  are  generalizations  of  the  concept  of  gradient  to  the  nonsmooth  case. 
Nonsmooth  optimization  is  beyond  the  scope  of  this  book;  we  refer  instead  to  Hiriart- 
Urruty  and  Lemarechal  [170]  for  an  extensive  discussion  of  theory.  Here,  we  mention 
only  that  the  minimization  of  a  function  such  as  the  one  illustrated  in  Figure  2.3  (which 
contains  a  jump  discontinuity  in  the  first  derivative  fix)  at  the  minimum)  is  difficult 
because  the  behavior  of  /  is  not  predictable  near  the  point  of  nonsmoothness.  That 
is,  we  cannot  be  sure  that  information  about  /  obtained  at  one  point  can  be  used 
to  infer  anything  about  /  at  neighboring  points,  because  points  of  nondifferentiabil¬ 
ity  may  intervene.  However,  minimization  of  certain  special  nondifferentiable  functions, 
such  as 


f{x)  =  ||r(*)||i,  fix )  =  ||r(x)||oo  (2.9) 

(where  r{x)  is  a  vector  function),  can  be  reformulated  as  smooth  constrained  optimiza¬ 
tion  problems;  see  Exercise  12.5  in  Chapter  12  and  (17.31).  The  functions  (2.9)  are 
useful  in  data  fitting,  where  r{x)  is  the  residual  vector  whose  components  are  defined 
in  (2.2). 


2.2  OVERVIEW  OF  ALGORITHMS 


The  last  forty  years  have  seen  the  development  of  a  powerful  collection  of  algorithms  for 
unconstrained  optimization  of  smooth  functions.  We  now  give  a  broad  description  of  their 
main  properties,  and  we  describe  them  in  more  detail  in  Chapters  3,  4,  5,  6,  and  7.  All 
algorithms  for  unconstrained  minimization  require  the  user  to  supply  a  starting  point, 
which  we  usually  denote  by  Xq.  The  user  with  knowledge  about  the  application  and  the 
data  set  may  be  in  a  good  position  to  choose  Xq  to  be  a  reasonable  estimate  of  the  solution. 
Otherwise,  the  starting  point  must  be  chosen  by  the  algorithm,  either  by  a  systematic 
approach  or  in  some  arbitrary  manner. 

Beginning  at  Xo,  optimization  algorithms  generate  a  sequence  of  iterates  {.yt}5£0 
that  terminate  when  either  no  more  progress  can  be  made  or  when  it  seems  that  a  so¬ 
lution  point  has  been  approximated  with  sufficient  accuracy.  In  deciding  how  to  move 
from  one  iterate  xu  to  the  next,  the  algorithms  use  information  about  the  function  /  at 
Xk,  and  possibly  also  information  from  earlier  iterates  xf),  X\, . . . ,  x^-\.  They  use  this  in¬ 
formation  to  find  a  new  iterate  xa-+i  with  a  lower  function  value  than  x^.  (There  exist 
nonmonotone  algorithms  that  do  not  insist  on  a  decrease  in  /  at  every  step,  but  even  these 
algorithms  require  /  to  be  decreased  after  some  prescribed  number  m  of  iterations,  that  is, 
fixk)  <  fiXk-m)-) 

There  are  two  fundamental  strategies  for  moving  from  the  current  point  Xk  to  a  new 
iterate  Xk+ 1-  Most  of  the  algorithms  described  in  this  book  follow  one  of  these  approaches. 
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TWO  STRATEGIES:  LINE  SEARCH  AND  TRUST  REGION 

In  the  line  search  strategy,  the  algorithm  chooses  a  direction  pk  and  searches  along 
this  direction  from  the  current  iterate  xk  for  a  new  iterate  with  a  lower  function  value. 
The  distance  to  move  along  pk  can  be  found  by  approximately  solving  the  following  one¬ 
dimensional  minimization  problem  to  find  a  step  length  a: 

min  f(xk  +  apk).  (2.10) 

a>0 


By  solving  (2.10)  exactly,  we  would  derive  the  maximum  benefit  from  the  direction  pk,  but 
an  exact  minimization  may  be  expensive  and  is  usually  unnecessary.  Instead,  the  line  search 
algorithm  generates  a  limited  number  of  trial  step  lengths  until  it  finds  one  that  loosely 
approximates  the  minimum  of  (2.10).  At  the  new  point,  a  new  search  direction  and  step 
length  are  computed,  and  the  process  is  repeated. 

In  the  second  algorithmic  strategy,  known  as  trust  region,  the  information  gathered 
about  /  is  used  to  construct  a  model  function  mk  whose  behavior  near  the  current  point 
Xk  is  similar  to  that  of  the  actual  objective  function  /.  Because  the  model  nik  may  not  be  a 
good  approximation  of  /  when  x  is  far  from  xk,  we  restrict  the  search  for  a  minimizer  of  mk 
to  some  region  around  xk.  In  other  words,  we  find  the  candidate  step  p  by  approximately 
solving  the  following  subproblem: 


min  nik{xk  +  p),  where  xk  +  p  lies  inside  the  trust  region.  (2.1 1) 

p 

If  the  candidate  solution  does  not  produce  a  sufficient  decrease  in  /,  we  conclude  that  the 
trust  region  is  too  large,  and  we  shrink  it  and  re-solve  (2.11).  Usually,  the  trust  region  is  a 
ball  defined  by  ||p  ||  2  <  A,  where  the  scalar  A  >  0  is  called  the  trust-region  radius.  Elliptical 
and  box-shaped  trust  regions  may  also  be  used. 

The  model  mk  in  (2.1 1)  is  usually  defined  to  be  a  quadratic  function  of  the  form 

mk(xk  +  p)  =  fk  +  pTVfk  +  \pTBkp,  (2.12) 

where  fk,  V /*,  and  Bk  are  a  scalar,  vector,  and  matrix,  respectively.  As  the  notation  indicates, 
fk  and  V  fk  are  chosen  to  be  the  function  and  gradient  values  at  the  point  xk,  so  that  mk 
and  /  are  in  agreement  to  first  order  at  the  current  iterate  xk.  The  matrix  Bk  is  either  the 
Hessian  V2  fk  or  some  approximation  to  it. 

Suppose  that  the  objective  function  is  given  by  f(x)  —  lOfe  —  x2 )2  +  (1  —  Xi)2.  At 
the  point  xk  —  (0,  1)  its  gradient  and  Hessian  are 
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Figure  2.4  Two  possible  trust  regions  (circles)  and  their  corresponding  steps  pu  .  The 
solid  lines  are  contours  of  the  model  function  m^. 


The  contour  lines  of  the  quadratic  model  (2.12)  with  Bk  —  V2/)-  are  depicted  in  Figure  2.4, 
which  also  illustrates  the  contours  of  the  objective  function  /  and  the  trust  region.  We 
have  indicated  contour  lines  where  the  model  has  values  1  and  12.  Note  from  Figure  2.4 
that  each  time  we  decrease  the  size  of  the  trust  region  after  failure  of  a  candidate  iterate, 
the  step  from  Xk  to  the  new  candidate  will  be  shorter,  and  it  usually  points  in  a  different 
direction  from  the  previous  candidate.  The  trust-region  strategy  differs  in  this  respect  from 
line  search,  which  stays  with  a  single  search  direction. 

In  a  sense,  the  line  search  and  trust-region  approaches  differ  in  the  order  in  which  they 
choose  the  direction  and  distance  of  the  move  to  the  next  iterate.  Line  search  starts  by  fixing 
the  direction  pk  and  then  identifying  an  appropriate  distance,  namely  the  step  length  a £-.  In 
trust  region,  we  first  choose  a  maximum  distance — the  trust-region  radius  Aj. — and  then 
seek  a  direction  and  step  that  attain  the  best  improvement  possible  subject  to  this  distance 
constraint.  If  this  step  proves  to  be  unsatisfactory,  we  reduce  the  distance  measure  A*.  and 
try  again. 

The  line  search  approach  is  discussed  in  more  detail  in  Chapter  3.  Chapter  4  discusses 
the  trust- region  strategy,  including  techniques  for  choosing  and  adjusting  the  size  of  the  re¬ 
gion  and  for  computing  approximate  solutions  to  the  trust-region  problems  (2.1 1).  We  now 
preview  two  major  issues:  choice  of  the  search  direction  pk  in  line  search  methods,  and  choice 
of  the  Hessian  Bk  in  trust- region  methods.  These  issues  are  closely  related,  as  we  now  observe. 

SEARCH  DIRECTIONS  FOR  LINE  SEARCH  METHODS 

The  steepest  descent  direction  — V/*  is  the  most  obvious  choice  for  search  direction 
for  a  line  search  method.  It  is  intuitive;  among  all  the  directions  we  could  move  from  x*, 
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it  is  the  one  along  which  /  decreases  most  rapidly.  To  verify  this  claim,  we  appeal  again 
to  Taylor’s  theorem  (Theorem  2.1),  which  tells  us  that  for  any  search  direction  p  and 
step-length  parameter  »,  we  have 

f{xk  +  ap )  =  f(x k)  +  apTV  fk  +  \u-pTV2f{xk  +  tp)p,  for  some  t  e  (0,  a) 

(see  (2.6)).  The  rate  of  change  in  /  along  the  direction  p  at  xk  is  simply  the  coefficient  of 
a,  namely,  pTV  fk.  Hence,  the  unit  direction  p  of  most  rapid  decrease  is  the  solution  to  the 
problem 


min  pTV  fk,  subject  to  ||/?||  =  1.  (2.13) 

p 

Since  pTV  ft  —  ||p||  ||  V/*||  cos 6  —  ||  V  /> ||  cos 0,  where  9  is  the  angle  between  p  and  V  fk, 
it  is  easy  to  see  that  the  minimizer  is  attained  when  cos  0  —  —  1  and 

P  =  -V/*/||V/*||, 

as  claimed.  As  we  illustrate  in  Figure  2.5,  this  direction  is  orthogonal  to  the  contours  of  the 
function. 

The  steepest  descent  method  is  a  line  search  method  that  moves  along  pk  —  —  V/*  at 
every  step.  It  can  choose  the  step  length  ak  in  a  variety  ofways,  as  we  discuss  in  Chapter  3.  One 
advantage  of  the  steepest  descent  direction  is  that  it  requires  calculation  of  the  gradient  V  fk 
but  not  of  second  derivatives.  However,  it  can  be  excruciatingly  slow  on  difficult  problems. 

Line  search  methods  may  use  search  directions  other  than  the  steepest  descent  direc¬ 
tion.  In  general,  any  descent  direction — one  that  makes  an  angle  of  strictly  less  than  tc / 2 
radians  with  —  V/j. — is  guaranteed  to  produce  a  decrease  in  /,  provided  that  the  step  length 


Figure  2.5  Steepest  descent  direction  for  a  function  of  two  variables. 
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Figure  2.6 

A  downhill  direction  pk . 


is  sufficiently  small  (see  Figure  2.6).  We  can  verify  this  claim  by  using  Taylor’s  theorem. 
From  (2.6),  we  have  that 

f(xk  +  €Pk)  =  f(xk)  +  cpTkV  fk  +  0(e2). 

When  pk  is  a  downhill  direction,  the  angle  6k  between  pk  and  V/*  has  cos  0k  <  0,  so  that 

Pl^fk  =  \\Pk\\  II V /, ||  cos 0k  <  0. 

It  follows  that  f(xk  +  epk)  <  f(xk)  for  all  positive  but  sufficiently  small  values  of  e. 

Another  important  search  direction — perhaps  the  most  important  one  of  all — 
is  the  Newton  direction.  This  direction  is  derived  from  the  second-order  Taylor  series 
approximation  to  f(xk  +  p),  which  is 

fix, t  +  p)  «  fk  +  pTVfk  +  \pTV2fkp  =  mk(p).  (2.14) 

Assuming  for  the  moment  that  V2fk  is  positive  definite,  we  obtain  the  Newton  direction 
by  finding  the  vector  p  that  minimizes  mkip).  By  simply  setting  the  derivative  of  mk(p )  to 
zero,  we  obtain  the  following  explicit  formula: 

Pk  =  ~  (V2/a-)_1  V/*.  (2.15) 

The  Newton  direction  is  reliable  when  the  difference  between  the  true  function 
f(xk  +  p)  and  its  quadratic  model  mk{p)  is  not  too  large.  By  comparing  (2.14)  with  (2.6), 
we  see  that  the  only  difference  between  these  functions  is  that  the  matrix  V2  f(xk  +  tp)  in 
the  third  term  of  the  expansion  has  been  replaced  by  V2  fk.  If  V2/  is  sufficiently  smooth, 
this  difference  introduces  a  perturbation  of  only  <9(||p||3)  into  the  expansion,  so  that  when 
||p ||  is  small,  the  approximation  f(xk  +  p)  ~  mk(p)  is  quite  accurate. 
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The  Newton  direction  can  be  used  in  a  line  search  method  when  V2/t  is  positive 
definite,  for  in  this  case  we  have 

^fkPk  =  ~pfV2fkpl  <  -ak\\pl\\2 

for  some  o>  >  0.  Unless  the  gradient  V  fk  (and  therefore  the  step  p £)  is  zero,  we  have  that 

V  ff  p j!  <  0,  so  the  Newton  direction  is  a  descent  direction. 

Unlike  the  steepest  descent  direction,  there  is  a  “natural”  step  length  of  1  associated 
with  the  Newton  direction.  Most  line  search  implementations  of  Newton’s  method  use  the 
unit  step  a  —  1  where  possible  and  adjust  a  only  when  it  does  not  produce  a  satisfactory 
reduction  in  the  value  of  /. 

When  V2/t  is  not  positive  definite,  the  Newton  direction  may  not  even  be  defined, 
since  ( V2  ff)  1  may  not  exist.  Even  when  it  is  defined,  it  may  not  satisfy  the  descent  property 

V  f[!  p j!  <  0,  in  which  case  it  is  unsuitable  as  a  search  direction.  In  these  situations,  line 
search  methods  modify  the  definition  of  p k  to  make  it  satisfy  the  descent  condition  while 
retaining  the  benefit  of  the  second-order  information  contained  in  V2//  We  describe  these 
modifications  in  Chapter  3. 

Methods  that  use  the  Newton  direction  have  a  fast  rate  of  local  convergence,  typically 
quadratic.  After  a  neighborhood  of  the  solution  is  reached,  convergence  to  high  accuracy 
often  occurs  in  just  a  few  iterations.  The  main  drawback  of  the  Newton  direction  is  the 
need  for  the  Hessian  V2/(x).  Explicit  computation  of  this  matrix  of  second  derivatives 
can  sometimes  be  a  cumbersome,  error-prone,  and  expensive  process.  Finite-difference  and 
automatic  differentiation  techniques  described  in  Chapter  8  may  be  useful  in  avoiding  the 
need  to  calculate  second  derivatives  by  hand. 

Quasi-Newton  search  directions  provide  an  attractive  alternative  to  Newton’s  method 
in  that  they  do  not  require  computation  of  the  Hessian  and  yet  still  attain  a  superlinear  rate 
of  convergence.  In  place  of  the  true  Hessian  V2/;,,  they  use  an  approximation  B k,  which  is 
updated  after  each  step  to  take  account  of  the  additional  knowledge  gained  during  the  step. 
The  updates  make  use  of  the  fact  that  changes  in  the  gradient  g  provide  information  about 
the  second  derivative  of  /  along  the  search  direction.  By  using  the  expression  (2.5)  from  our 
statement  of  Taylor’s  theorem,  we  have  by  adding  and  subtracting  the  term  V2/(x)p  that 

V/(x  +  p)  =  V  f{x)  +  V2f(x)p  +  f  [V2/(x  +  tp)  -  V2/ (x)]  pdt. 

Jo 

Because  V  /(•)  is  continuous,  the  size  of  the  final  integral  term  is  o  ( 1 1  p  \  | ) .  By  setting  x  =  xk 
and  p  =  xk+\  —  xk,  we  obtain 


V/t+i  =  V/t  +  V2  fk{xk+ 1  -  xk)  +  o(||x*+i  -  xk\\). 


When  xk  and  xk+i  lie  in  a  region  near  the  solution  x*,  within  which  V2/  is  positive  definite, 
the  final  term  in  this  expansion  is  eventually  dominated  by  the  V2fk{xk+ 1  —  x*)  term,  and 
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we  can  write 


V2fk(xk+1  -  xk)  »  V/,+1  -  Vfk.  (2.16) 

We  choose  the  new  Hessian  approximation  Bk+l  so  that  it  mimics  the  property  (2.16)  of 
the  true  Hessian,  that  is,  we  require  it  to  satisfy  the  following  condition,  known  as  the  secant 
equation: 


Bk+isk  —  yk, 


(2.17) 


where 


si k  —  xk+i  -  xk,  yk  —  V/t+i  -  V/t. 


Typically,  we  impose  additional  conditions  on  Bk+l,  such  as  symmetry  (motivated  by 
symmetry  of  the  exact  Hessian),  and  a  requirement  that  the  difference  between  successive 
approximations  Bk  and  Bk+i  have  low  rank. 

Two  of  the  most  popular  formulae  for  updating  the  Hessian  approximation  Bk  are 
the  symmetric-rank-one  (SRI)  formula,  defined  by 


Bk+  i  —  Bk  + 


(yk  -  Bksk)(yk  -  Bksk)T 
(yk  -  Bksk)Tsk 


(2.18) 


and  the  BFGS  formula,  named  after  its  inventors,  Broyden,  Fletcher,  Goldfarb,  and  Shanno, 
which  is  defined  by 


Bk+ 1  =  Bk  — 


Bksksl  Bk  ykyk 
sTk  Bksk  yl sk 


(2.19) 


Note  that  the  difference  between  the  matrices  Bk  and  Bk+ 1  is  a  rank-one  matrix  in  the 
case  of  (2.18)  and  a  rank-two  matrix  in  the  case  of  (2.19).  Both  updates  satisfy  the  secant 
equation  and  both  maintain  symmetry.  One  can  show  that  BFGS  update  (2.19)  generates 
positive  definite  approximations  whenever  the  initial  approximation  Bq  is  positive  definite 
and  skyk  >  0.  We  discuss  these  issues  further  in  Chapter  6. 

The  quasi-Newton  search  direction  is  obtained  by  using  Bk  in  place  of  the  exact 
Hessian  in  the  formula  (2.15),  that  is, 


Pk  =  ~BklVfk.  (2.20) 

Some  practical  implementations  of  quasi-Newton  methods  avoid  the  need  to  factorize  Bk 
at  each  iteration  by  updating  the  inverse  of  Bk,  instead  of  Bk  itself.  In  fact,  the  equivalent 
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formula  for  (2.18)  and  (2.19),  applied  to  the  inverse  approximation  Hk  =  Bk  is 

Hk+ i  =  (/  -  PkSkyl)  Hk  (/  -  pkykSk)  +  PkSkSk,  Pk  —  ~f—-  (2.21) 

yk  sk 

Calculation  of  pk  can  then  be  performed  by  using  the  formula  pk  —  —  //a-  V  fk ■  This  matrix- 
vector  multiplication  is  simpler  than  the  factorization/back-substitution  procedure  that  is 
needed  to  implement  the  formula  (2.20). 

Two  variants  of  quasi-Newton  methods  designed  to  solve  large  problems — partially 
separable  and  limited-memory  updating — are  described  in  Chapter  7. 

The  last  class  of  search  directions  we  preview  here  is  that  generated  by  nonlinear 
conjugate  gradient  methods.  They  have  the  form 

Pk  =  -V  f(x k)  +  PkPk-t, 

where  ft  is  a  scalar  that  ensures  that  pk  and  pk~ \  are  conjugate — an  important  concept 
in  the  minimization  of  quadratic  functions  that  will  be  defined  in  Chapter  5.  Conjugate 
gradient  methods  were  originally  designed  to  solve  systems  of  linear  equations  Ax  =  b, 
where  the  coefficient  matrix  A  is  symmetric  and  positive  definite.  The  problem  of  solving 
this  linear  system  is  equivalent  to  the  problem  of  minimizing  the  convex  quadratic  function 
defined  by 


<f>(x)  =  \xT  Ax  —  bT x, 

so  it  was  natural  to  investigate  extensions  of  these  algorithms  to  more  general  types  of 
unconstrained  minimization  problems.  In  general,  nonlinear  conjugate  gradient  directions 
are  much  more  effective  than  the  steepest  descent  direction  and  are  almost  as  simple  to 
compute.  These  methods  do  not  attain  the  fast  convergence  rates  of  Newton  or  quasi- 
Newton  methods,  but  they  have  the  advantage  of  not  requiring  storage  of  matrices.  An 
extensive  discussion  of  nonlinear  conjugate  gradient  methods  is  given  in  Chapter  5. 

All  of  the  search  directions  discussed  so  far  can  be  used  directly  in  a  line  search 
framework.  They  give  rise  to  the  steepest  descent,  Newton,  quasi-Newton,  and  conjugate 
gradient  line  search  methods.  All  except  conjugate  gradients  have  an  analogue  in  the  trust- 
region  framework,  as  we  now  discuss. 

MODELS  FOR  TRUST-REGION  METHODS 

If  we  set  Bk  —  0  in  (2.12)  and  define  the  trust  region  using  the  Euclidean  norm,  the 
trust-region  subproblem  (2.11)  becomes 

min  fk  +  pTV  fk  subject  to  ||p||2  <  A*. 
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We  can  write  the  solution  to  this  problem  in  closed  form  as 


^  A  ,V/, 

Pk  II  v A  || ' 

This  is  simply  a  steepest  descent  step  in  which  the  step  length  is  determined  by  the  trust- 
region  radius;  the  trust-region  and  line  search  approaches  are  essentially  the  same  in  this  case. 

A  more  interesting  trust-region  algorithm  is  obtained  by  choosing  B k  to  be  the 
exact  Hessian  V2/*  in  the  quadratic  model  (2.12).  Because  of  the  trust-region  restriction 
\\p\\i  <  A k,  the  subproblem  (2.11)  is  guaranteed  to  have  a  solution  even  when  V2A  is  not 
positive  definite  pk,  as  we  see  in  Figure  2.4.  The  trust-region  Newton  method  has  proved  to 
be  highly  effective  in  practice,  as  we  discuss  in  Chapter  7. 

If  the  matrix  Bk  in  the  quadratic  model  function  mk  of  (2.12)  is  defined  by  means  of 
a  quasi-Newton  approximation,  we  obtain  a  trust-region  quasi-Newton  method. 


SCALING 

The  performance  of  an  algorithm  may  depend  crucially  on  how  the  problem  is  formu¬ 
lated.  One  important  issue  in  problem  formulation  is  scaling.  In  unconstrained  optimization, 
a  problem  is  said  to  be  poorly  scaled  if  changes  to  x  in  a  certain  direction  produce  much  larger 
variations  in  the  value  of  /  than  do  changes  to  x  in  another  direction.  A  simple  example  is 
provided  by  the  function  f(x)  —  109x2  +  x\,  which  is  very  sensitive  to  small  changes  in  X\ 
but  not  so  sensitive  to  perturbations  in  X2 ■ 

Poorly  scaled  functions  arise,  for  example,  in  simulations  of  physical  and  chemical 
systems  where  different  processes  are  taking  place  at  very  different  rates.  To  be  more  specific, 
consider  a  chemical  system  in  which  four  reactions  occur.  Associated  with  each  reaction  is 
a  rate  constant  that  describes  the  speed  at  which  the  reaction  takes  place.  The  optimization 
problem  is  to  find  values  for  these  rate  constants  by  observing  the  concentrations  of  each 
chemical  in  the  system  at  different  times.  The  four  constants  differ  greatly  in  magnitude,  since 
the  reactions  take  place  at  vastly  different  speeds.  Suppose  we  have  the  following  rough  esti¬ 
mates  for  the  final  values  of  the  constants,  each  correct  to  within,  say,  an  order  of  magnitude: 

Xi  1(T10,  X2  F«  X3  1,  x4  105. 

Before  solving  this  problem  we  could  introduce  a  new  variable  z  defined  by 
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and  then  define  and  solve  the  optimization  problem  in  terms  of  the  new  variable  z-  The 
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Figure  2.7  Poorly  scaled  and  well  scaled  problems,  and  performance  of  the  steepest 
descent  direction. 


optimal  values  of  z  will  be  within  about  an  order  of  magnitude  of  1,  making  the  solution 
more  balanced.  This  kind  of  scaling  of  the  variables  is  known  as  diagonal  scaling. 

Scaling  is  performed  (sometimes  unintentionally)  when  the  units  used  to  represent 
variables  are  changed.  During  the  modeling  process,  we  may  decide  to  change  the  units  of 
some  variables,  say  from  meters  to  millimeters.  If  we  do,  the  range  of  those  variables  and 
their  size  relative  to  the  other  variables  will  both  change. 

Some  optimization  algorithms,  such  as  steepest  descent,  are  sensitive  to  poor  scaling, 
while  others,  such  as  Newton’s  method,  are  unaffected  by  it.  Figure  2.7  shows  the  contours 
of  two  convex  nearly  quadratic  functions,  the  first  of  which  is  poorly  scaled,  while  the  second 
is  well  scaled.  For  the  poorly  scaled  problem,  the  one  with  highly  elongated  contours,  the 
steepest  descent  direction  does  not  yield  much  reduction  in  the  function,  while  for  the 
well-scaled  problem  it  performs  much  better.  In  both  cases,  Newton’s  method  will  produce 
a  much  better  step,  since  the  second-order  quadratic  model  (m*  in  (2.14))  happens  to  be  a 
good  approximation  of  /. 

Algorithms  that  are  not  sensitive  to  scaling  are  preferable,  because  they  can  handle 
poor  problem  formulations  in  a  more  robust  fashion.  In  designing  complete  algorithms,  we 
try  to  incorporate  scale  invariance  into  all  aspects  of  the  algorithm,  including  the  line  search 
or  trust-region  strategies  and  convergence  tests.  Generally  speaking,  it  is  easier  to  preserve 
scale  invariance  for  line  search  algorithms  than  for  trust-region  algorithms. 


i#7  Exercises 


2.1  Compute  the  gradient  V/(x)  and  Hessian  V2/(.r)  of  the  Rosenbrock  function 
fix)  =  100(x2  -  v2)2  +  (1  -  x,)2.  (2.22) 
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Show  that  x*  =  (1,  l)r  is  the  only  local  minimizer  of  this  function,  and  that  the  Hessian 
matrix  at  that  point  is  positive  definite. 

&  2.2  Show  that  the  function  fix)  —  8xi  +  12x2  +  x\  —  2xf  has  only  one  stationary 

point,  and  that  it  is  neither  a  maximum  or  minimum,  but  a  saddle  point.  Sketch  the  contour 
lines  of  /. 

&  2.3  Let  a  be  a  given  n -vector,  and  A  be  a  given  n  x  n  symmetric  matrix.  Compute  the 

gradient  and  Hessian  of  fi(x)  —  aT x  and  fiix)  —  xT  Ax. 

&  2.4  Write  the  second-order  Taylor  expansion  (2.6)  for  the  function  cos(l/x)  around 

a  nonzero  point  x,  and  the  third-order  Taylor  expansion  of  cos(x)  around  any  point  x. 
Evaluate  the  second  expansion  for  the  specific  case  of  x  =  1. 

&  2.5  Consider  the  function  /  :  R2  ->  R  defined  by  /(x)  =  ||x||2.  Show  that  the 

sequence  of  iterates  {x*}  defined  by 

cos  A: 
sin  A 


satisfies  f(xk+\ )  <  fixt)  for  A  =  0,  1,  2, ... .  Show  that  every  point  on  the  unit  circle 
{x  |  ||  x  || 2  =  1}  is  a  limit  point  for  {xa-}.  Hint:  Every  value  9  e  [0,  2n]  is  a  limit  point  of  the 
subsequence  {§a}  defined  by 


§a  =  A(mod  2tt)  —  A  —  27T 


A 

27 r 


where  the  operator  [-J  denotes  rounding  down  to  the  next  integer. 

2.6  Prove  that  all  isolated  local  minimizers  are  strict.  (Hint:  Take  an  isolated  local 
minimizer  x*  and  a  neighborhood  AT.  Show  that  for  any  x  e  A f,  x  ^  x*  we  must  have 
fix)  >  fix*).) 

&  2.7  Suppose  that  fix)  —  xT  Qx,  where  Q  is  an  n  x  n  symmetric  positive  semidefinite 

matrix.  Show  using  the  definition  (1.4)  that  /  (x )  is  convex  on  the  domain  R" .  Hint:  It  may 
be  convenient  to  prove  the  following  equivalent  inequality: 


fiy  +  a{x  -  y))  -  a  fix)  -  (1  -  a)fiy)  <  0, 


for  all  a  e  [0,  1]  and  all  x,  y  e  R". 

2.8  Suppose  that  /  is  a  convex  function.  Show  that  the  set  of  global  minimizers  of  / 
is  a  convex  set. 
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2.9  Consider  the  function  f(x  1,^2)  =  {x\  +  xf)2.  At  the  point  xr  =  (1,0)  we 
consider  the  search  direction  pT  =  (—1,  1).  Show  that  p  is  a  descent  direction  and  find  all 
minimizers  of  the  problem  (2.10). 

i#7  2.10  Suppose  that  f(z)  —  f(x),  where  x  —  Sz  +  s  for  some  S  e  R"x"  and  s  eR". 

Show  that 


V/(z)  =  STVf(x),  V2/(z)  =  STV2f(x)S. 

(Hint:  Use  the  chain  rule  to  express  d f/dzj  in  terms  of  df/dxi  and  dxi/dzj  for  all 
i,j  —  1,  2 . n.) 

i#7  2.11  Show  that  the  symmetric  rank-one  update  (2.18)  and  the  BFGS  update  (2.19) 

are  scale-invariant  if  the  initial  Hessian  approximations  B0  are  chosen  appropriately.  That 
is,  using  the  notation  of  the  previous  exercise,  show  that  if  these  methods  are  applied  to 
/(x)  starting  from  xo  =  Sz 0  +  s  with  initial  Hessian  B0,  and  to  f(z)  starting  from  z0  with 
initial  Hessian  ST B0S,  then  all  iterates  are  related  by  x*  —  Szk  +  s.  (Assume  for  simplicity 
that  the  methods  take  unit  step  lengths.) 

i#7  2.12  Suppose  that  a  function  /  of  two  variables  is  poorly  scaled  at  the  solution  x*. 

Write  two  Taylor  expansions  of  /  around  x* — one  along  each  coordinate  direction — and 
use  them  to  show  that  the  Hessian  V2/(x*)  is  ill-conditioned. 

i#7  2.13  (For  this  and  the  following  three  questions,  refer  to  the  material  on  “Rates  of 

Convergence”  in  Section  A.2  of  the  Appendix.)  Show  that  the  sequence  Xk  —  l/k  is  not 
Q-linearly  convergent,  though  it  does  converge  to  zero.  (This  is  called  sublinear  convergence.) 

i#7  2.14  Show  that  the  sequence  Xk  —  1  +  (0.5)21  is  Q-quadratically  convergent  to  1. 

i#7  2.15  Does  the  sequence  Xk  —  \/k\  converge  Q-superlinearly?  Q-quadratically? 

i#7  2.16  Consider  the  sequence  {**}  defined  by 

[  (if,  k  even, 

|  (xk-i)/k,  k  odd. 

Is  this  sequence  Q-superlinearly  convergent?  Q-quadratically  convergent?  R-quadratically 
convergent? 


Chapter 


Line  Search 
Methods 


Each  iteration  of  a  line  search  method  computes  a  search  direction  pk  and  then  decides  how 
far  to  move  along  that  direction.  The  iteration  is  given  by 


Xk+i  =  xk  +  akpk,  (3.1) 

where  the  positive  scalar  oik  is  called  the  step  length.  The  success  of  a  line  search  method 
depends  on  effective  choices  of  both  the  direction  pk  and  the  step  length  a*. 

Most  line  search  algorithms  require  pk  to  be  a  descent  direction — one  for  which 
/?JV  ft  <  0 — because  this  property  guarantees  that  the  function  /  can  be  reduced  along 
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this  direction,  as  discussed  in  the  previous  chapter.  Moreover,  the  search  direction  often  has 
the  form 


Pk  =  -B?Vfk,  (3.2) 

where  Bk  is  a  symmetric  and  nonsingular  matrix.  In  the  steepest  descent  method,  Bk  is 
simply  the  identity  matrix  I,  while  in  Newton’s  method,  Bk  is  the  exact  Hessian  V2  f(xk). 
In  quasi-Newton  methods,  Bk  is  an  approximation  to  the  Hessian  that  is  updated  at  every 
iteration  by  means  of  a  low-rank  formula.  When  pk  is  defined  by  (3.2)  and  Bk  is  positive 
definite,  we  have 


Pt^fk  =  -V/*r B^Vfk  <  0, 


and  therefore  pk  is  a  descent  direction. 

In  this  chapter,  we  discuss  how  to  choose  or*  and  pk  to  promote  convergence  from 
remote  starting  points.  We  also  study  the  rate  of  convergence  of  steepest  descent,  quasi- 
Newton,  and  Newton  methods.  Since  the  pure  Newton  iteration  is  not  guaranteed  to  produce 
descent  directions  when  the  current  iterate  is  not  close  to  a  solution,  we  discuss  modifications 
in  Section  3.4  that  allow  it  to  start  from  any  initial  point. 

We  now  give  careful  consideration  to  the  choice  of  the  step-length  parameter  a*. 


3.1  STEP  LENGTH 


In  computing  the  step  length  a*>  we  face  a  tradeoff.  We  would  like  to  choose  a*  to  give 
a  substantial  reduction  of  /,  but  at  the  same  time  we  do  not  want  to  spend  too  much 
time  making  the  choice.  The  ideal  choice  would  be  the  global  minimizer  of  the  univariate 
function  </>(■)  defined  by 


</>(«)  =  /(**  +  apk),  a  >  0,  (3.3) 

but  in  general,  it  is  too  expensive  to  identify  this  value  (see  Figure  3.1).  To  find  even  a  local 
minimizer  of  (p  to  moderate  precision  generally  requires  too  many  evaluations  of  the  objec¬ 
tive  function  /  and  possibly  the  gradient  V/.  More  practical  strategies  perform  an  inexact 
line  search  to  identify  a  step  length  that  achieves  adequate  reductions  in  /  at  minimal  cost. 

Typical  line  search  algorithms  try  out  a  sequence  of  candidate  values  for  a,  stopping  to 
accept  one  of  these  values  when  certain  conditions  are  satisfied.  The  line  search  is  done  in  two 
stages:  A  bracketing  phase  finds  an  interval  containing  desirable  step  lengths,  and  a  bisection 
or  interpolation  phase  computes  a  good  step  length  within  this  interval.  Sophisticated  line 
search  algorithms  can  be  quite  complicated,  so  we  defer  a  full  description  until  Section  3.5. 
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We  now  discuss  various  termination  conditions  for  line  search  algorithms  and  show 
that  effective  step  lengths  need  not  lie  near  minimizers  of  the  univariate  function  </>(a) 
defined  in  (3.3). 

A  simple  condition  we  could  impose  on  a*  is  to  require  a  reduction  in  /,  that  is, 
f(.Xk  +  ctkPk)  <  f(xk)-  That  this  requirement  is  not  enough  to  produce  convergence  to 
x *  is  illustrated  in  Figure  3.2,  for  which  the  minimum  function  value  is  /*  =  —1,  but  a 
sequence  of  iterates  [xk\  for  which  f{xk)  =  5/ k,  k  —  0,  1, . . .  yields  a  decrease  at  each 
iteration  but  has  a  limiting  function  value  of  zero.  The  insufficient  reduction  in  /  at  each 
step  causes  it  to  fail  to  converge  to  the  minimizer  of  this  convex  function.  To  avoid  this 
behavior  we  need  to  enforce  a  sufficient  decrease  condition,  a  concept  we  discuss  next. 


Figure  3.2  Insufficient  reduction  in  /. 
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THE  WOLFE  CONDITIONS 

A  popular  inexact  line  search  condition  stipulates  that  a k  should  first  of  all  give 
sufficient  decrease  in  the  objective  function  /,  as  measured  by  the  following  inequality: 

f{xk  +  apk)  <  f(xk)  +  ciaV/Ar pk,  (3.4) 

for  some  constant  Ci  e  (0,  1).  In  other  words,  the  reduction  in  /  should  be  proportional  to 
both  the  step  length  ak  and  the  directional  derivative  V  ff  pk.  Inequality  (3.4)  is  sometimes 
called  the  Armijo  condition. 

The  sufficient  decrease  condition  is  illustrated  in  Figure  3.3.  The  right-hand-side  of 
(3.4),  which  is  a  linear  function,  can  be  denoted  by /(a).  The  function /(•)  has  negative  slope 
ciVfkPk>  but  because  Ci  e  (0,  1),  it  lies  above  the  graph  of  0  for  small  positive  values  of 
a.  The  sufficient  decrease  condition  states  that  a  is  acceptable  only  if  0(a)  <  1(a).  The 
intervals  on  which  this  condition  is  satisfied  are  shown  in  Figure  3.3.  In  practice,  c i  is  chosen 
to  be  quite  small,  say  C\  —  10~4. 

The  sufficient  decrease  condition  is  not  enough  by  itself  to  ensure  that  the  algorithm 
makes  reasonable  progress  because,  as  we  see  from  Figure  3.3,  it  is  satisfied  for  all  sufficiently 
small  values  of  a.  To  rule  out  unacceptably  short  steps  we  introduce  a  second  requirement, 
called  the  curvature  condition,  which  requires  ak  to  satisfy 

V / (xk  +  akpk)T pk  >  c2V// pk,  (3.5) 

for  some  constant  c2  €  (ci ,  1),  where  ci  is  the  constant  from  (3.4).  Note  that  the  left-hand- 
side  is  simply  the  derivative  <f>'  (ak ),  so  the  curvature  condition  ensures  that  the  slope  of  0  at 
ak  is  greater  than  c2  times  the  initial  slope  </>'(  0).  This  makes  sense  because  if  the  slope  cp'(a) 


Figure  3.3  Sufficient  decrease  condition. 
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is  strongly  negative,  we  have  an  indication  that  we  can  reduce  /  significantly  by  moving 
further  along  the  chosen  direction. 

On  the  other  hand,  if  <p'(ak)  is  only  slightly  negative  or  even  positive,  it  is  a  sign  that 
we  cannot  expect  much  more  decrease  in  /  in  this  direction,  so  it  makes  sense  to  terminate 
the  line  search.  The  curvature  condition  is  illustrated  in  Figure  3.4.  Typical  values  of  C2  are 
0.9  when  the  search  direction  pk  is  chosen  by  a  Newton  or  quasi-Newton  method,  and  0.1 
when  pk  is  obtained  from  a  nonlinear  conjugate  gradient  method. 

The  sufficient  decrease  and  curvature  conditions  are  known  collectively  as  the  Wolfe 
conditions.  We  illustrate  them  in  Figure  3.5  and  restate  them  here  for  future  reference: 

f(xk  +  a kpk)  <  f{xk )  +  ciakVf£ pk,  (3.6a) 

V/(*ft  +  akpk)T pk  >  c2V// pk,  (3.6b) 


with  0  <  Ci  <  C2  <  1. 

A  step  length  may  satisfy  the  Wolfe  conditions  without  being  particularly  close  to  a 
minimizer  of  cp,  as  we  show  in  Figure  3.5.  We  can,  however,  modify  the  curvature  condition 
to  force  ak  to  lie  in  at  least  a  broad  neighborhood  of  a  local  minimizer  or  stationary  point 
of  ( p .  The  strong  Wolfe  conditions  require  ak  to  satisfy 

f(xk  +  otkpk)  <  f{xk)  +  ckakV ff  pk,  (3.7a) 

|V/(x*  +  akpk)Tpk |  <  c2|V// pk |,  (3.7b) 

with  0  <  ci  <  c2  <  1.  The  only  difference  with  the  Wolfe  conditions  is  that  we  no  longer 
allow  the  derivative  <p\otk)  to  be  too  positive.  Hence,  we  exclude  points  that  are  far  from 
stationary  points  of  <p. 


3.1  . 
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Figure  3.5  Step  lengths  satisfying  the  Wolfe  conditions. 


It  is  not  difficult  to  prove  that  there  exist  step  lengths  that  satisfy  the  Wolfe  conditions 
for  every  function  /  that  is  smooth  and  bounded  below. 

Lemma  3.1 . 

Suppose  that  f  :  R”  —>  R  is  continuously  differentiable.  Let  pk  be  a  descent  direction  at 
xk,  and  assume  that  f  is  bounded  below  along  the  ray  {xk  +  apk\a  >  0}.  Then  if0<c\  < 
C2  <  1,  there  exist  intervals  of  step  lengths  satisfying  the  Wolfe  conditions  (3.6)  and  the  strong 
Wolfe  conditions  (3.7). 

Proof.  Note  that  <p(a)  =  f(xk  +  apk)  is  bounded  below  for  all  a  >  0.  Since  0  <  Ci  <  1, 
the  line  1(a)  =  f(xk)  +  ac\  V  ff  pk  is  unbounded  below  and  must  therefore  intersect  the 
graph  of  < p  at  least  once.  Let  a'  >  0  be  the  smallest  intersecting  value  of  a,  that  is, 

f(xk  +  a'pk)  =  f(xk)  +  a' ci  V//  pk.  (3.8) 

The  sufficient  decrease  condition  (3.6a)  clearly  holds  for  all  step  lengths  less  than  a' . 

By  the  mean  value  theorem  (see  (A.55)),  there  exists  a"  e  (0,  a')  such  that 

f(xk  +  a'pk)  -  f(xk)  =  a'V  f(xk  +  a" pk)T pk.  (3.9) 

By  combining  (3.8)  and  (3.9),  we  obtain 


V/te  +  a" pk)T pk  =  CiV// pk  >  C2S fk  Pk ,  (3.10) 

since  Ci  <  C2  and  V/Ar pk  <  0.  Therefore,  a"  satisfies  the  Wolfe  conditions  (3.6),  and  the 
inequalities  hold  strictly  in  both  (3.6a)  and  (3.6b).  Hence,  by  our  smoothness  assumption 
on  /,  there  is  an  interval  around  a"  for  which  the  Wolfe  conditions  hold.  Moreover,  since 
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the  term  in  the  left-hand  side  of  (3. 10)  is  negative,  the  strong  Wolfe  conditions  (3.7)  hold  in 
the  same  interval.  □ 

The  Wolfe  conditions  are  scale-invariant  in  a  broad  sense:  Multiplying  the  objective 
function  by  a  constant  or  making  an  affine  change  of  variables  does  not  alter  them.  They  can 
be  used  in  most  line  search  methods,  and  are  particularly  important  in  the  implementation 
of  quasi-Newton  methods,  as  we  see  in  Chapter  6. 


THE  GOLDSTEIN  CONDITIONS 

Like  the  Wolfe  conditions,  the  Goldstein  conditions  ensure  that  the  step  length  a 
achieves  sufficient  decrease  but  is  not  too  short.  The  Goldstein  conditions  can  also  be  stated 
as  a  pair  of  inequalities,  in  the  following  way: 

fix, t)  +  (1  -  c)a*V  fl  pk  <  f(xk  +  akpk )  <  f(xk)  +  coikV  f[  pk,  (3.11) 

with  0  <  c  <  1/2.  The  second  inequality  is  the  sufficient  decrease  condition  (3.4),  whereas 
the  first  inequality  is  introduced  to  control  the  step  length  from  below;  see  Figure  3.6 

A  disadvantage  of  the  Goldstein  conditions  vis-a-vis  the  Wolfe  conditions  is  that  the 
first  inequality  in  (3. 1 1 )  may  exclude  all  minimizers  of  tp.  However,  the  Goldstein  and  Wolfe 
conditions  have  much  in  common,  and  their  convergence  theories  are  quite  similar.  The 
Goldstein  conditions  are  often  used  in  Newton-type  methods  but  are  not  well  suited  for 
quasi-Newton  methods  that  maintain  a  positive  definite  Hessian  approximation. 


Figure  3.6  The  Goldstein  conditions. 
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SUFFICIENT  DECREASE  AND  BACKTRACKING 

We  have  mentioned  that  the  sufficient  decrease  condition  (3.6a)  alone  is  not  sufficient 
to  ensure  that  the  algorithm  makes  reasonable  progress  along  the  given  search  direction. 
However,  if  the  line  search  algorithm  chooses  its  candidate  step  lengths  appropriately,  by 
using  a  so-called  backtracking  approach,  we  can  dispense  with  the  extra  condition  (3.6b) 
and  use  just  the  sufficient  decrease  condition  to  terminate  the  line  search  procedure.  In  its 
most  basic  form,  backtracking  proceeds  as  follows. 

Algorithm  3.1  (Backtracking  Line  Search). 

Choose  a  >  0,  p  e  (0,  1),  c  e  (0,  1);  Set  a  or, 

repeat  until  f(xk  +  apk)  </(**)  +  caV  pk 
a  •<—  pa; 

end  (repeat) 

Terminate  with  ak  =  a. 

In  this  procedure,  the  initial  step  length  a  is  chosen  to  be  1  in  Newton  and  quasi- 
Newton  methods,  but  can  have  different  values  in  other  algorithms  such  as  steepest  descent 
or  conjugate  gradient.  An  acceptable  step  length  will  be  found  after  a  finite  number  of 
trials,  because  ak  will  eventually  become  small  enough  that  the  sufficient  decrease  condition 
holds  (see  Figure  3.3).  In  practice,  the  contraction  factor  p  is  often  allowed  to  vary  at  each 
iteration  of  the  line  search.  For  example,  it  can  be  chosen  by  safeguarded  interpolation,  as 
we  describe  later.  We  need  ensure  only  that  at  each  iteration  we  have  pe  [p\0,  Phi]>  for  some 
fixed  constants  0  <  p\Q  <  Phi  <  1. 

The  backtracking  approach  ensures  either  that  the  selected  step  length  ak  is  some  fixed 
value  (the  initial  choice  a),  or  else  that  it  is  short  enough  to  satisfy  the  sufficient  decrease 
condition  but  not  too  short.  The  latter  claim  holds  because  the  accepted  value  ak  is  within 
a  factor  p  of  the  previous  trial  value,  ak/p,  which  was  rejected  for  violating  the  sufficient 
decrease  condition,  that  is,  for  being  too  long. 

This  simple  and  popular  strategy  for  terminating  a  line  search  is  well  suited  for  Newton 
methods  but  is  less  appropriate  for  quasi-Newton  and  conjugate  gradient  methods. 


3.2  CONVERGENCE  OF  LINE  SEARCH  METHODS 


To  obtain  global  convergence,  we  must  not  only  have  well  chosen  step  lengths  but  also  well 
chosen  search  directions  pk.  We  discuss  requirements  on  the  search  direction  in  this  section, 
focusing  on  one  key  property:  the  angle  9k  between  pk  and  the  steepest  descent  direction 
— V/*,  defined  by 


cos  6k  = 


-VfkP* 

iiv/*h  ii  Pk  ir 


(3.12) 


38  Chapter  3.  Line  Search  Methods 


The  following  theorem,  due  to  Zoutendijk,  has  far-reaching  consequences.  It  quantifies 
the  effect  of  properly  chosen  step  lengths  a* ,  and  shows,  for  example,  that  the  steepest  descent 
method  is  globally  convergent.  For  other  algorithms,  it  describes  how  far  pk  can  deviate 
from  the  steepest  descent  direction  and  still  produce  a  globally  convergent  iteration.  Various 
line  search  termination  conditions  can  be  used  to  establish  this  result,  but  for  concreteness 
we  will  consider  only  the  Wolfe  conditions  (3.6).  Though  Zoutendijk’s  result  appears  at  first 
to  be  technical  and  obscure,  its  power  will  soon  become  evident. 

Theorem  3.2. 

Consider  any  iteration  of  the  form  (3.1 ),  where  pk  is  a  descent  direction  and  oik  satisfies 
the  Wolfe  conditions  (3.6).  Suppose  that  f  is  bounded  below  in  R"  and  that  f  is  continuously 
differentiable  in  an  open  setAf  containing  the  level  set  C  =f  { x  :  f(x)  <  f(x 0)},  where  x0  is 
the  starting  point  of  the  iteration.  Assume  also  that  the  gradient  V  f  is  Lipschitz  continuous  on 
J\f,  that  is,  there  exists  a  constant  L  >  0  such  that 

l|V/(x)  —  V/(x)||  <  L\\x  —  it ||,  for  all.r,  x  e  Af.  (3.13) 


Then 


Y2  c°s2^-  iivA-ii2  <  °°. 


k>  0 


PROOF.  From  (3.6b)  and  (3.1)  we  have  that 


(v  fk+i  -  vfk)TPk  >  (c2  -  mfkPk, 


while  the  Lipschitz  condition  (3.13)  implies  that 

(Vfk+i-Vfk)TPk<<XkL\\Pk\\2. 
By  combining  these  two  relations,  we  obtain 


oik  > 


C2  ~  1  Vf'pk 


(3.14) 


L  II Pk  II2  ' 

By  substituting  this  inequality  into  the  first  Wolfe  condition  (3.6a),  we  obtain 


L  ||  Pk 

From  the  definition  (3.12),  we  can  write  this  relation  as 


fk+ i  <  fk  ~  c cos2  0k\\V  fk\\2 , 


3.2. 
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where  c  =  Ci(l  —  C2)/L.  By  summing  this  expression  over  all  indices  less  than  or  equal  to 
k,  we  obtain 


k 

fk+ 1  <  /o-c£cos^||V/,||2.  (3.15) 

7=0 

Since  /  is  bounded  below,  we  have  that  /o  —  fk+ 1  is  less  than  some  positive  constant,  for  all 
k.  Hence,  by  taking  limits  in  (3.15),  we  obtain 

OO 

cos2  O/cW'V fk\\2  <  OO, 

k=0 


which  concludes  the  proof.  □ 

Similar  results  to  this  theorem  hold  when  the  Goldstein  conditions  (3.11)  or  strong 
Wolfe  conditions  (3.7)  are  used  in  place  of  the  Wolfe  conditions.  For  all  these  strategies,  the 
step  length  selection  implies  inequality  (3.14),  which  we  call  the  Zoutendijk  condition. 

Note  that  the  assumptions  of  Theorem  3.2  are  not  too  restrictive.  If  the  function  /  were 
not  bounded  below,  the  optimization  problem  would  not  be  well  defined.  The  smoothness 
assumption — Lipschitz  continuity  of  the  gradient — is  implied  by  many  of  the  smoothness 
conditions  that  are  used  in  local  convergence  theorems  (see  Chapters  6  and  7)  and  are  often 
satisfied  in  practice. 

The  Zoutendijk  condition  (3.14)  implies  that 

COS2  e, II V. All 2 -A  0.  (3.16) 

This  limit  can  be  used  in  turn  to  derive  global  convergence  results  for  line  search  algorithms. 

If  our  method  for  choosing  the  search  direction  pk  in  the  iteration  (3.1)  ensures  that 
the  angle  9k  defined  by  (3.12)  is  bounded  away  from  90°,  there  is  a  positive  constant  S  such 
that 


cos  6k  >  8  >  0,  for  all  k.  (3.17) 

It  follows  immediately  from  (3.16)  that 

lim  ||V/*||  =  0.  (3.18) 

k — >-oo 

In  other  words,  we  can  be  sure  that  the  gradient  norms  ||  V/*||  converge  to  zero,  provided 
that  the  search  directions  are  never  too  close  to  orthogonality  with  the  gradient.  In  particular, 
the  method  of  steepest  descent  (for  which  the  search  direction  pk  is  parallel  to  the  negative 
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gradient)  produces  a  gradient  sequence  that  converges  to  zero,  provided  that  it  uses  a  line 
search  satisfying  the  Wolfe  or  Goldstein  conditions. 

We  use  the  term  globally  convergent  to  refer  to  algorithms  for  which  the  property 
(3.18)  is  satisfied,  but  note  that  this  term  is  sometimes  used  in  other  contexts  to  mean 
different  things.  For  line  search  methods  of  the  general  form  (3.1),  the  limit  (3.18)  is  the 
strongest  global  convergence  result  that  can  be  obtained:  We  cannot  guarantee  that  the 
method  converges  to  a  minimizer,  but  only  that  it  is  attracted  by  stationary  points.  Only 
by  making  additional  requirements  on  the  search  direction  pk — by  introducing  negative 
curvature  information  from  the  Hessian  V2/(v^),  for  example — can  we  strengthen  these 
results  to  include  convergence  to  a  local  minimum.  See  the  Notes  and  References  at  the  end 
of  this  chapter  for  further  discussion  of  this  point. 

Consider  now  the  Newton-like  method  (3.1),  (3.2)  and  assume  that  the  matrices 
are  positive  definite  with  a  uniformly  bounded  condition  number.  That  is,  there  is  a  constant 
M  such  that 


II-B/.-II  II  Bk  <  M,  for  all  k. 

It  is  easy  to  show  from  the  definition  (3.12)  that 

cos  6k  >  1/M  (3.19) 

(see  Exercise  3.5).  By  combining  this  bound  with  (3.16)  we  find  that 

lim  ||V/*||  =  0.  (3.20) 

k^-oo 

Therefore,  we  have  shown  that  Newton  and  quasi-Newton  methods  are  globally  convergent 
if  the  matrices  B have  a  bounded  condition  number  and  are  positive  definite  (which  is 
needed  to  ensure  that  p is  a  descent  direction),  and  if  the  step  lengths  satisfy  the  Wolfe 
conditions. 

For  some  algorithms,  such  as  conjugate  gradient  methods,  we  will  be  able  to  prove 
the  limit  (3.18),  but  only  the  weaker  result 

lim  inf  ||V/*||  =  0.  (3.21) 

k^-oo 

In  other  words,  just  a  subsequence  of  the  gradient  norms  ||  V/k.  ||  converges  to  zero,  rather 
than  the  whole  sequence  (see  Appendix  A).  This  result,  too,  can  be  proved  by  using  Zou- 
tendijk’s  condition  (3.14),  but  instead  of  a  constructive  proof,  we  outline  a  proof  by 
contradiction.  Suppose  that  (3.21)  does  not  hold,  so  that  the  gradients  remain  bounded 
away  from  zero,  that  is,  there  exists  y  >  0  such  that 


||  V /t||  >  y,  for  all  k  sufficiently  large. 


(3.22) 
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Then  from  (3.16)  we  conclude  that 


cos  Ok  — >■  0,  (3.23) 

that  is,  the  entire  sequence  {cos  Ok]  converges  to  0.  To  establish  (3.21),  therefore,  it  is  enough 
to  show  that  a  subsequence  {cos  Ok  }  is  bounded  away  from  zero.  We  will  use  this  strategy  in 
Chapter  5  to  study  the  convergence  of  nonlinear  conjugate  gradient  methods. 

By  applying  this  proof  technique,  we  can  prove  global  convergence  in  the  sense  of 
(3.20)  or  (3.21)  for  a  general  class  of  algorithms.  Consider  any  algorithm  for  which  (i)  every 
iteration  produces  a  decrease  in  the  objective  function,  and  (ii)  every  mth  iteration  is  a 
steepest  descent  step,  with  step  length  chosen  to  satisfy  the  Wolfe  or  Goldstein  conditions. 
Then,  since  cos  Ok  —  1  for  the  steepest  descent  steps,  the  result  (3.21)  holds.  Of  course,  we 
would  design  the  algorithm  so  that  it  does  something  “better"  than  steepest  descent  at  the 
other  m  —  1  iterates.  The  occasional  steepest  descent  steps  may  not  make  much  progress, 
but  they  at  least  guarantee  overall  global  convergence. 

Note  that  throughout  this  section  we  have  used  only  the  fact  that  Zoutendijk’s  condi¬ 
tion  implies  the  limit  (3. 1 6) .  In  later  chapters  we  will  make  use  of  the  bounded  sum  condition 
(3.14),  which  forces  the  sequence  {cos2  0k\\  V fk\\2}  to  converge  to  zero  at  a  sufficiently  rapid 
rate. 


3.3  RATE  OF  CONVERGENCE 


It  would  seem  that  designing  optimization  algorithms  with  good  convergence  properties  is 
easy,  since  all  we  need  to  ensure  is  that  the  search  direction  pk  does  not  tend  to  become 
orthogonal  to  the  gradient  V/*,  or  that  steepest  descent  steps  are  taken  regularly.  We  could 
simply  compute  cos  Ok  at  every  iteration  and  turn  pk  toward  the  steepest  descent  direction  if 
cos  Ok  is  smaller  than  some  preselected  constant  8  >  0.  Angle  tests  of  this  type  ensure  global 
convergence,  but  they  are  undesirable  for  two  reasons.  First,  they  may  impede  a  fast  rate  of 
convergence,  because  for  problems  with  an  ill-conditioned  Hessian,  it  may  be  necessary  to 
produce  search  directions  that  are  almost  orthogonal  to  the  gradient,  and  an  inappropriate 
choice  of  the  parameter  8  may  cause  such  steps  to  be  rejected.  Second,  angle  tests  destroy 
the  invariance  properties  of  quasi-Newton  methods. 

Algorithmic  strategies  that  achieve  rapid  convergence  can  sometimes  conflict  with 
the  requirements  of  global  convergence,  and  vice  versa.  For  example,  the  steepest  descent 
method  is  the  quintessential  globally  convergent  algorithm,  but  it  is  quite  slow  in  practice, 
as  we  shall  see  below.  On  the  other  hand,  the  pure  Newton  iteration  converges  rapidly  when 
started  close  enough  to  a  solution,  but  its  steps  may  not  even  be  descent  directions  away 
from  the  solution.  The  challenge  is  to  design  algorithms  that  incorporate  both  properties: 
good  global  convergence  guarantees  and  a  rapid  rate  of  convergence. 

We  begin  our  study  of  convergence  rates  of  line  search  methods  by  considering  the 
most  basic  approach  of  all:  the  steepest  descent  method. 
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CONVERGENCE  RATE  OF  STEEPEST  DESCENT 

We  can  learn  much  about  the  steepest  descent  method  by  considering  the  ideal  case,  in 
which  the  objective  function  is  quadratic  and  the  line  searches  are  exact.  Let  us  suppose  that 

f(x)  —  jXT  Qx  —  bT x,  (3.24) 

where  Q  is  symmetric  and  positive  definite.  The  gradient  is  given  by  V/(v)  =  Qx  —  b  and 
the  minimizer  x *  is  the  unique  solution  of  the  linear  system  Qx  —  b. 

It  is  easy  to  compute  the  step  length  ak  that  minimizes  /  (xk — a  V  fk ) .  By  differentiating 
the  function 


f(xk  -  aVfk)  =  ^{xk  -  aV fk)T Q(xk  -  aV  fk)  -  bT{xk  -  aV fk) 
with  respect  to  a,  and  setting  the  derivative  to  zero,  we  obtain 


V//V.A- 
ak  VfiQVfk 


(3.25) 


If  we  use  this  exact  minimizer  ak,  the  steepest  descent  iteration  for  (3.24)  is  given  by 

(  VfkVfk 

xk+i  —  xk  —  - ^ -  V  fk.  (3.26 

\vf?QVfk) 

Since  V/*  =  Qxk  —  b,  this  equation  yields  a  closed- form  expression  forx*+i  in  terms  ofx^. 
In  Figure  3.7  we  plot  a  typical  sequence  of  iterates  generated  by  the  steepest  descent  method 
on  a  two-dimensional  quadratic  objective  function.  The  contours  of  /  are  ellipsoids  whose 
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axes  lie  along  the  orthogonal  eigenvectors  of  Q.  Note  that  the  iterates  zigzag  toward  the 
solution. 

To  quantify  the  rate  of  convergence  we  introduce  the  weighted  norm  ||x||  2q  —  xt  Qx. 
By  using  the  relation  Qx*  —  b,  we  can  show  that 

\\\x-x*W  Q  =  f{x)  -  f(x*),  (3.27) 

so  this  norm  measures  the  difference  between  the  current  objective  value  and  the  optimal 
value.  By  using  the  equality  (3.26)  and  noting  that  V/)-  =  <2(**  —  x*),  we  can  derive  the 
equality 


ll*A-+l-X*||2e  = 


(V//V.A)2 

(v/,revA)(v//e-iv/,) 


II**  —  •*'*11  g 


(3.28) 


(see  Exercise  3.7).  This  expression  describes  the  exact  decrease  in  /  at  each  iteration,  but 
since  the  term  inside  the  brackets  is  difficult  to  interpret,  it  is  more  useful  to  bound  it  in 
terms  of  the  condition  number  of  the  problem. 


Theorem  3.3. 

When  the  steepest  descent  method  with  exact  line  searches  ( 3.26 )  is  applied  to  the  strongly 
convex  quadratic  function  (3.24),  the  error  norm  (3.27)  satisfies 

ll**+i  -  **ll q  <  Q”  +  II**  -  **lle>  (3-29) 

where  0  <  T-i  <  A, 2  <  •  •  •  <  X„  are  the  eigenvalues  of  Q. 

The  proof  of  this  result  is  given  by  Luenberger  [195] .  The  inequalities  (3.29)  and  (3.27) 
show  that  the  function  values  /)-  converge  to  the  minimum  /*  at  a  linear  rate.  As  a  special 
case  of  this  result,  we  see  that  convergence  is  achieved  in  one  iteration  if  all  the  eigenvalues 
are  equal.  In  this  case,  Q  is  a  multiple  of  the  identity  matrix,  so  the  contours  in  Figure  3.7 
are  circles  and  the  steepest  descent  direction  always  points  at  the  solution.  In  general,  as  the 
condition  number  k(Q)  =  Xn/X  1  increases,  the  contours  of  the  quadratic  become  more 
elongated,  the  zigzagging  in  Figure  3.7  becomes  more  pronounced,  and  (3.29)  implies  that 
the  convergence  degrades.  Even  though  (3.29)  is  a  worst-case  bound,  it  gives  an  accurate 
indication  of  the  behavior  of  the  algorithm  when  n  >  2. 

The  rate-of-convergence  behavior  of  the  steepest  descent  method  is  essentially  the 
same  on  general  nonlinear  objective  functions.  In  the  following  result  we  assume  that  the 
step  length  is  the  global  minimizer  along  the  search  direction. 

Theorem  3.4. 

Suppose  that  f  :  R”  — >■  R  is  twice  continuously  differentiable,  and  that  the  iterates 
generated  by  the  steepest-descent  method  with  exact  line  searches  converge  to  a  point  x*  at 
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which  the  Hessian  matrix  V2  fix*)  is  positive  definite.  Letr  be  any  scalar  satisfying 


(Xn  A-i 

X n  +  ^1 


where  <  X2  <  ■  ■  ■  <  Xn  are  the  eigenvalues  of V2  f  {x*).  Then  for  all  k  sufficiently  large,  we 
have 


f(xk+ 1)  -  fix*)  <  r2[f(xk)  -  fix*)]. 

In  general,  we  cannot  expect  the  rate  of  convergence  to  improve  if  an  inexact  line 
search  is  used.  Therefore,  Theorem  3.4  shows  that  the  steepest  descent  method  can  have  an 
unacceptably  slow  rate  of  convergence,  even  when  the  Hessian  is  reasonably  well  conditioned. 
For  example,  if xiQ)  —  800,  fix  1)  =  1,  and  fix*)  —  0,  Theorem  3.4  suggests  that  the 
function  value  will  still  be  about  0.08  after  one  thousand  iterations  of  the  steepest  descent 
method  with  exact  line  search. 

NEWTON’S  METHOD 

We  now  consider  the  Newton  iteration,  for  which  the  search  is  given  by 

Pi  =  -V2/t-‘V/t.  (3.30) 

Since  the  Hessian  matrix  V2/i  may  not  always  be  positive  definite,  p’f  may  not  always 
be  a  descent  direction,  and  many  of  the  ideas  discussed  so  far  in  this  chapter  no  longer 
apply.  In  Section  3.4  and  Chapter  4  we  will  describe  two  approaches  for  obtaining  a  globally 
convergent  iteration  based  on  the  Newton  step:  a  line  search  approach,  in  which  the  Hessian 
V2/i-  is  modified,  if  necessary,  to  make  it  positive  definite  and  thereby  yield  descent,  and  a 
trust  region  approach,  in  which  V2  fk  is  used  to  form  a  quadratic  model  that  is  minimized 
in  a  ball  around  the  current  iterate  xk. 

Here  we  discuss  just  the  local  rate-of-convergence  properties  of  Newton’s  method. 
We  know  that  for  all  x  in  the  vicinity  of  a  solution  point  x*  such  that  V2/(x*)  is  positive 
definite,  the  Hessian  V2/(x)  will  also  be  positive  definite.  Newton’s  method  will  be  well 
defined  in  this  region  and  will  converge  quadratically,  provided  that  the  step  lengths  ak  are 
eventually  always  1 . 

Theorem  3.5. 

Suppose  that  f  is  twice  differentiable  and  that  the  Hessian  V2  fix)  isLipschitz  continuous 
(see  (A.42))  in  a  neighborhood  of  a  solution. x*  atwhich  the  sufficient  conditions  (Theorem  2.4) 
are  satisfied.  Consider  the  iteration  xk+i  —  xk  +  pk ,  where  pk  is  given  by  (3.30).  Then 

(i)  if  the  starting  point  xq  is  sufficiently  close  to  x*,  the  sequence  of  iterates  converges  to  x*; 

(ii)  the  rate  of  convergence  of{xk}  is  quadratic;  and 

(iii)  the  sequence  of  gradient  norms  { ||  V  fk  || }  converges  quadratically  to  zero. 
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PROOF.  From  the  definition  of  the  Newton  step  and  the  optimality  condition  V  /*  =  0  we 
have  that 


xk  +  Pi  -  X*  =  xk  -  x*  -  v2/^1  V/, 

=  V2/^1  \y2  fk(xk  -  x*)  -  (Vfk  -  V/J] . 


(3.31) 


Since  Taylor’s  theorem  (Theorem  2.1)  tells  us  that 

V/*  -  v/*  =  f  V2/ (xk  +  t(x*  -  xk)){xk  -  x*)  dt, 
Jo 

we  have 


|V2/(xt)(jrt-x*)-(V/t-  V/(x* 


[  \y2  f(xk)  -  v2/(x)t  +  t(x*  -  **))]  (. xk  -  X*)dt 
Jo 


<-f 


y\f(xk)  -  V  f(xk  +  r(x*  -  xk))  ||  xk  -  x*||  dt 


<  llATjt  -  X* 


7‘ 


Lt  dt  —  i.L||x*  —  x*||2, 


(3.32) 


where  L  is  the  Lipschitz  constant  for  V2/(x)  for  x  near  x*.  Since  V2/(x*)  is  nonsingular, 
there  is  a  radius  r  >  0  such  that  ||V2/t_I||  <  2||V2/(x*)_1 1|  for  all  xk  with  ||x*  —  x*||  <  r. 
By  substituting  in  (3.31)  and  (3.32),  we  obtain 


II x*  +  pi  -  x*||  <  L\\V2f(x*rl\\\\xk  -  x*||2  =  Z||x,  -  x*||2,  (3.33) 

where  L  =  L||V2/(x*)-1||.  Choosing  x0  so  that  ||x0  —  x*||  <  min(r,  1/(2L)),  we  can  use 
this  inequality  inductively  to  deduce  that  the  sequence  converges  to  x*,  and  the  rate  of 
convergence  is  quadratic. 

By  using  the  relations  xk+\  —  xk  —  pi  and  V/*  +  V2/^. pi  —  0,  we  obtain  that 

l|V/(xt+1)||  =  ||V/(x,+1)  -  V/,  -  V2f(xk)pl\\ 

—  [  V2/(x^  +  tpl)(xk+ 1  -  xk)  dt  -  V2f{xk)pl 

Jo 

<  /  1 V2 / (x*  +  tpl)  —  V2/(x*)||  \\pl\\dt 
Jo 

<  \L\\PtW2 

<  |T||V2/(xi-r1||2||V/,||2 
<2L||V2/(x*r1||2||V/,||2, 


proving  that  the  gradient  norms  converge  to  zero  quadratically. 


□ 
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As  the  iterates  generated  by  Newton’s  method  approach  the  solution,  the  Wolfe  (or 
Goldstein)  conditions  will  accept  the  step  length  a k  =  1  for  all  large  k.  This  observation 
follows  from  Theorem  3.6  below.  Indeed,  when  the  search  direction  is  given  by  Newton’s 
method,  the  limit  (3.35)  is  satisfied — the  ratio  is  zero  for  all  /c!  Implementations  of  Newton’s 
method  using  these  line  search  conditions,  and  in  which  the  line  search  always  tries  the  unit 
step  length  first,  will  set  a k  —  1  for  all  large  k  and  attain  a  local  quadratic  rate  of  convergence. 


QUASI-NEWTON  METHODS 

Suppose  now  that  the  search  direction  has  the  form 

Pk  =  ~BflVfk,  (3.34) 

where  the  symmetric  and  positive  definite  matrix  B *  is  updated  at  every  iteration  by  a 
quasi-Newton  updating  formula.  We  already  encountered  one  quasi-Newton  formula,  the 
BFGS  formula,  in  Chapter  2;  others  will  be  discussed  in  Chapter  6.  We  assume  here  that  the 
step  length  ct k  is  computed  by  an  inexact  line  search  that  satisfies  the  Wolfe  or  strong  Wolfe 
conditions,  with  the  same  proviso  mentioned  above  for  Newton’s  method:  The  line  search 
algorithm  will  always  try  the  step  length  a  —  1  first,  and  will  accept  this  value  if  it  satisfies 
the  Wolfe  conditions.  (We  could  enforce  this  condition  by  setting  a  —  1  in  Algorithm  3.1, 
for  example.)  This  implementation  detail  turns  out  to  be  crucial  in  obtaining  a  fast  rate  of 
convergence. 

The  following  result  shows  that  if  the  search  direction  of  a  quasi-Newton  method 
approximates  the  Newton  direction  well  enough,  then  the  unit  step  length  will  satisfy  the 
Wolfe  conditions  as  the  iterates  converge  to  the  solution.  It  also  specifies  a  condition  that 
the  search  direction  must  satisfy  in  order  to  give  rise  to  a  superlinearly  convergent  iteration. 
To  bring  out  the  full  generality  of  this  result,  we  state  it  first  in  terms  of  a  general  descent 
iteration,  and  then  examine  its  consequences  for  quasi-Newton  and  Newton  methods. 


Theorem  3.6. 

Suppose  that  f  :  R"  — >  R  is  twice  continuously  differentiable.  Consider  the  iteration 
Xk+\  —  Xk  +  oikPk .  where  pk  is  a  descent  direction  and  o/k  satisfies  the  Wolfe  conditions  (3.6) 
withc\  <  1/2.  If  the  sequence  {.%k}  converges  to  a  point  x*  suchthatV  f(x*)  =  0  andV2  f(x*) 
is  positive  definite,  and  if  the  search  direction  satisfies 


lim 

k — >-oo 


l|V/*  +  V2/*P*ll 
II  Pi  II 


=  0, 


(3.35) 


then 

(i)  the  step  length  a.k  —  1  is  admissible  for  all  k  greater  than  a  certain  index  k0\  and 

(ii)  ifcik  —  1  for  all  k  >  k0,  {.*>}  converges  to  x*  superlinearly. 
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It  is  easy  to  see  that  if  C\  >  1  /2,  then  the  line  search  would  exclude  the  minimizer  of 
a  quadratic,  and  unit  step  lengths  may  not  be  admissible. 

If  pk  is  a  quasi-Newton  search  direction  of  the  form  (3.34),  then  (3.35)  is  equivalent  to 


lim 

k—>oo 


\\(Bk-V2f(x*))Pk\\ 
II  Pk  II 


=  0. 


(3.36) 


Hence,  we  have  the  surprising  (and  delightful)  result  that  a  superlinear  convergence  rate 
can  be  attained  even  if  the  sequence  of  quasi-Newton  matrices  Bk  does  not  converge  to 
V2/(x*);  it  suffices  that  the  Bk  become  increasingly  accurate  approximations  to  V2/(.r*) 
along  the  search  directions  pk.  Importantly,  condition  (3.36)  is  both  necessary  and  sufficient 
for  the  superlinear  convergence  of  quasi-Newton  methods. 


Theorem  3.7. 

Suppose  that  f  :  R"  — >  R  is  twice  continuously  differentiable.  Consider  the  iteration 
xk+i  —  xk  +  pk  (thatis,  the  step  length  ak  is  uniformly  1)  and  that  pk  is  given  by  (3.34).  Let  us 
assume  also  that{xk }  converges  to  a  point  x*  such  thatV  f(x*)  —  0  andV2  f(x*)  is  positive 
definite.  Then  [xk]  converges  superlinearly  if  and  only  if  (3.36)  holds. 


Proof.  We  first  show  that  (3.36)  is  equivalent  to 


Pk  ~  Pk  =  o{\\pk\\), 


(3.37) 


where  p J?  =  — V2/A.  1  V  />  is  the  Newton  step.  Assuming  that  (3.36)  holds,  we  have  that 


Pk  ~  Pk  =  V2ff\V2fkpk  +  V/t) 

=  v2/ir1(v2/,  -  Bk)Pk 
=  0(\\(V2fk- Bk)pk\\) 

=  o(\\pk\\), 

where  we  have  used  the  fact  that  ||  V2/^-1 1|  is  bounded  above  for  xk  sufficiently  close  to  x*, 
since  the  limiting  Hessian  V2/(x*)  is  positive  definite.  The  converse  follows  readily  if  we 
multiply  both  sides  of  (3.37)  by  V2/*  and  recall  (3.34). 

By  combining  (3.33)  and  (3.37),  we  obtain  that 

II**  +  Pk  ~  x*\\  <  \\xk  +  pk  ~x*\\  +  || pk  -  Pk II  =  0(11**  -  **H2)  +  o(||p*||). 

A  simple  manipulation  of  this  inequality  reveals  that  ||  pk  ||  =  O  ( \\xk  —  x*  || ),  so  we  obtain 

II**  +  Pk  ~  **ll  <  o( |k*  -  **ll), 


giving  the  superlinear  convergence  result. 


□ 
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We  will  see  in  Chapter  6  that  quasi-Newton  methods  normally  satisfy  condition  (3.36) 
and  are  therefore  superlinearly  convergent. 


3.4  NEWTON'S  METHOD  WITH  HESSIAN  MODIFICATION 


Away  from  the  solution,  the  Hessian  matrix  V2/(x)  may  not  be  positive  definite,  so  the 
Newton  direction  defined  by 


V2/(**K  =  -V/U*)  (3.38) 

(see  (3.30))  may  not  be  a  descent  direction.  We  now  describe  an  approach  to  overcome  this 
difficulty  when  a  direct  linear  algebra  technique,  such  as  Gaussian  elimination,  is  used  to 
solve  the  Newton  equations  (3.38).  This  approach  obtains  the  step  pk  from  a  linear  system 
identical  to  (3.38),  except  that  the  coefficient  matrix  is  replaced  with  a  positive  definite 
approximation,  formed  before  or  during  the  solution  process.  The  modified  Hessian  is 
obtained  by  adding  either  a  positive  diagonal  matrix  or  a  full  matrix  to  the  true  Hessian 
V2/(x, t).  A  general  description  of  this  method  follows. 

Algorithm  3.2  (Line  Search  Newton  with  Modification). 

Given  initial  point  xq; 
for  k  =  0,  1,  2, . . . 

Factorize  the  matrix  B k  =  V2f{xk)  +  Ek,  where  Ek  —  0  if  V2/(x^) 
is  sufficiently  positive  definite;  otherwise,  Ek  is  chosen  to 
ensure  that  B k  is  sufficiently  positive  definite; 

Solve  Bkpk  —  -V/(**); 

Set  xk+ 1  —  xk  T  ak pk,  where  ak  satisfies  the  Wolfe,  Goldstein,  or 
Armijo  backtracking  conditions; 

end 


Some  approaches  do  not  compute  Ek  explicitly,  but  rather  introduce  extra  steps  and 
tests  into  standard  factorization  procedures,  modifying  these  procedures  “on  the  fly”  so 
that  the  computed  factors  are  the  factors  of  a  positive  definite  matrix.  Strategies  based  on 
modifying  a  Cholesky  factorization  and  on  modifying  a  symmetric  indefinite  factorization 
of  the  Hessian  are  described  in  this  section. 

Algorithm  3.2  is  a  practical  Newton  method  that  can  be  applied  from  any  starting 
point.  We  can  establish  fairly  satisfactory  global  convergence  results  for  it,  provided  that 
the  strategy  for  choosing  Ek  (and  hence  Bk)  satisfies  the  bounded  modified  factorization 
property.  This  property  is  that  the  matrices  in  the  sequence  {Bk}  have  bounded  condition 
number  whenever  the  sequence  of  Hessians  { V2/(.yt)}  is  bounded;  that  is, 

K(Bk)  —  ll-B^II  <  C,  some  C  >  0  and  all  A'  =  0,  1,  2, ... . 


(3.39) 
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If  this  property  holds,  global  convergence  of  the  modified  line  search  Newton  method  follows 
from  the  results  of  Section  3.2. 

Theorem  3.8. 

Let  f  be  twice  continuously  differentiable  on  an  open  setV,  and  assume  that  the  starting 
point  xo  of  Algorithm  3.2  is  such  that  the  level  set  C  =  {x  e  V  :  fix)  <  fix  o)}  is  compact. 
Then  if  the  bounded  modified  factorization  property  holds,  we  have  that 


lim  V/(x*)  =  0. 

k^-oo 


For  a  proof  this  result  see  [215]. 

We  now  consider  the  convergence  rate  of  Algorithm  3.2.  Suppose  that  the  sequence 
of  iterates  ay,  converges  to  a  point  a*  where  V2/(x*)  is  sufficiently  positive  definite  in  the 
sense  that  the  modification  strategies  described  in  the  next  section  return  the  modification 
Ek  =  0  for  all  sufficiently  large  k.  By  Theorem  3.6,  we  have  that  oik  —  1  for  all  sufficiently 
large  k,  so  that  Algorithm  3.2  reduces  to  a  pure  Newton  method,  and  the  rate  of  convergence 
is  quadratic. 

For  problems  in  which  V /*  is  close  to  singular,  there  is  no  guarantee  that  the  mod¬ 
ification  Ek  will  eventually  vanish,  and  the  convergence  rate  may  be  only  linear.  Besides 
requiring  the  modified  matrix  Bk  to  be  well  conditioned  (so  that  Theorem  3.8  holds),  we 
would  like  the  modification  to  be  as  small  as  possible,  so  that  the  second-order  information 
in  the  Hessian  is  preserved  as  far  as  possible.  Naturally,  we  would  also  like  the  modified 
factorization  to  be  computable  at  moderate  cost. 

To  set  the  stage  for  the  matrix  factorization  techniques  that  will  be  used  in  Al¬ 
gorithm  3.2,  we  will  begin  by  assuming  that  the  eigenvalue  decomposition  of  V2/(v^)  is 
available.  This  is  not  realistic  for  large-scale  problems  because  this  decomposition  is  generally 
too  expensive  to  compute,  but  it  will  motivate  several  practical  modification  strategies. 

EIGENVALUE  MODIFICATION 

Consider  a  problem  in  which,  at  the  current  iterate  Xk,  V /(ay)  =  (1,  —3,  2)r  and 
V2/(ay)  —  diagflO,  3,  —  1),  which  is  clearly  indefinite.  By  the  spectral  decomposition 
theorem  (see  Appendix  A)  we  can  define  Q  =  I  and  A  =  diag(ki,  W,  Lf),  and  write 

n 

V2/te)  =  QAQT  =  (3-4°) 

i= 1 

The  pure  Newton  step — the  solution  of  (3.38) — is  p £  =  (—0.1,  1,  2)r,  which  is  not  a  de¬ 
scent  direction,  since  V  fix k)T  Pk  >  0.  One  might  suggest  a  modified  strategy  in  which  we 
replace  V2/(.ri)  by  a  positive  definite  approximation  Bk,  in  which  all  negative  eigenvalues 
in  V2/(ay)  are  replaced  by  a  small  positive  number  <5  that  is  somewhat  larger  than  ma¬ 
chine  precision  u;  say  S  =  ^/u.  For  a  machine  precision  of  10-16,  the  resulting  matrix  in 
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our  example  is 


2 

Bk  —  ^2  hq,q  '  +  Sq3ql  =  diag  (10,  3,  1(T8)  ,  (3.41) 

;=i 

which  is  numerically  positive  definite  and  whose  curvature  along  the  eigenvectors  q\  and 
q2  has  been  preserved.  Note,  however,  that  the  search  direction  based  on  this  modified 
Hessian  is 


Pk  =  -Bk  lvfk  =  -J2  j~qi{q?vfk)  -  \qi{ql v/(**)) 

i=i  Ai 

«-(2x108)93.  (3.42) 

For  small  S,  this  step  is  nearly  parallel  to  q}  (with  relatively  small  contributions  from  qx  and 
q2)  and  quite  long.  Although  /  decreases  along  the  direction  pk,  its  extreme  length  violates 
the  spirit  of  Newton’s  method,  which  relies  on  a  quadratic  approximation  of  the  objective 
function  that  is  valid  in  a  neighborhood  of  the  current  iterate  Xk-  It  is  therefore  not  clear 
that  this  search  direction  is  effective. 

Various  other  modification  strategies  are  possible.  We  could  flip  the  signs  of  the 
negative  eigenvalues  in  (3.40),  which  amounts  to  setting  <5=1  in  our  example.  We  could 
set  the  last  term  in  (3.42)  to  zero,  so  that  the  search  direction  has  no  components  along 
the  negative  curvature  directions.  We  could  adapt  the  choice  of  <5  to  ensure  that  the  length 
of  the  step  is  not  excessive,  a  strategy  that  has  the  flavor  of  trust-region  methods.  As  this 
discussion  shows,  there  is  a  great  deal  of  freedom  in  devising  modification  strategies,  and 
there  is  currently  no  agreement  on  which  strategy  is  best. 

Setting  the  issue  of  the  choice  of  <5  aside  for  the  moment,  let  us  look  more  closely  at  the 
process  of  modifying  a  matrix  so  that  it  becomes  positive  definite.  The  modification  (3.41) 
to  the  example  matrix  (3.40)  can  be  shown  to  be  optimal  in  the  following  sense.  If  A  is  a 
symmetric  matrix  with  spectral  decomposition  A  =  QAQT ,  then  the  correction  matrix 
AA  of  minimum  Frobenius  norm  that  ensures  that  /Lm;n(A  +  A  A)  >  <5  is  given  by 

f  0,  Xj  >  5, 

AA  =  Q  diag  {ti)QT,  with  r,  =  \  (3.43) 

I  0  —  hi,  hi  <  0. 

Here,  Am;n(A)  denotes  the  smallest  eigenvalue  of  A,  and  the  Frobenius  norm  of  a  matrix  is 
defined  as  ||  A\\2F  —  J2'i  j= i  a?j  (see  (^.9)).  Note  that  A  A  is  not  diagonal  in  general,  and  that 
the  modified  matrix  is  given  by 

A  +  AA  =  Q{  A  +  diag(r,))gr. 

By  using  a  different  norm  we  can  obtain  a  diagonal  modification.  Suppose  again  that 
A  is  a  symmetric  matrix  with  spectral  decomposition  A  =  QAQT .  A  correction  matrix 


3.4. 


Newton’s  Method  with  Hessian  Modification  51 


AA  with  minimum  Euclidean  norm  that  satisfies  km;n(A  +  AA)  >  S  is  given  by 

AA  =  r/,  with  r  =  max(0, 5  —  lm;n(A)).  (3.44) 

The  modified  matrix  now  has  the  form 


A  +  r /,  (3.45) 

which  happens  to  have  the  same  form  as  the  matrix  occurring  in  (unsealed)  trust-region 
methods  (see  Chapter  4).  All  the  eigenvalues  of  (3.45)  have  thus  been  shifted,  and  all  are 
greater  than  S. 

These  results  suggest  that  both  diagonal  and  nondiagonal  modifications  can  be  con¬ 
sidered.  Even  though  we  have  not  answered  the  question  of  what  constitutes  a  good 
modification,  various  practical  diagonal  and  nondiagonal  modifications  have  been  pro¬ 
posed  and  implemented  in  software.  They  do  not  make  use  of  the  spectral  decomposition  of 
the  Elessian,  since  it  is  generally  too  expensive  to  compute.  Instead,  they  use  Gaussian  elim¬ 
ination,  choosing  the  modifications  indirectly  and  hoping  that  somehow  they  will  produce 
good  steps.  Numerical  experience  indicates  that  the  strategies  described  next  often  (but  not 
always)  produce  good  search  directions. 

ADDING  A  MULTIPLE  OF  THE  IDENTITY 

Perhaps  the  simplest  idea  is  to  find  a  scalar  r  >  0  such  that  V2  /  [xk )  +  r  I  is  sufficiently 
positive  definite.  From  the  previous  discussion  we  know  that  r  must  satisfy  (3.44),  but  a  good 
estimate  of  the  smallest  eigenvalue  of  the  Hessian  is  normally  not  available.  The  following 
algorithm  describes  a  method  that  tries  successively  larger  values  of  r.  (Here,  an  denotes  a 
diagonal  element  of  A.) 

Algorithm  3.3  (Cholesky  with  Added  Multiple  of  the  Identity). 

Choose  P  >  0; 

if  min,-  an  >  0 
set  r0  a-  0; 

else 

t0  =  —  min(a,,)  +  P\ 

end  (if) 

for  k  —  0,  1,  2,  . . . 

Attempt  to  apply  the  Cholesky  algorithm  to  obtain  LLt  —  A  +  t^I', 
if  the  factorization  is  completed  successfully 
stop  and  return  L; 

else 

t*+i  max(2r*,  0); 

end  (if) 

end  (for) 
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The  choice  of  is  heuristic;  a  typical  value  is  —  10-3.  We  could  choose  the  first 
nonzero  shift  To  to  be  proportional  to  be  the  final  value  of  r  used  in  the  latest  Hessian 
modification;  see  also  Algorithm  B.l.  The  strategy  implemented  in  Algorithm  3.3  is  quite 
simple  and  may  be  preferable  to  the  modified  factorization  techniques  described  next,  but 
it  suffers  from  one  drawback.  Every  value  of  r*  requires  a  new  factorization  of  A  +  x^I,  and 
the  algorithm  can  be  quite  expensive  if  several  trial  values  are  generated.  Therefore  it  may 
be  advantageous  to  increase  r  more  rapidly,  say  by  a  factor  of  10  instead  of  2  in  the  last  else 
clause. 


MODIFIED  CHOLESKY  FACTORIZATION 

Another  approach  for  modifying  a  Hessian  matrix  that  is  not  positive  definite  is 
to  perform  a  Cholesky  factorization  of  V2/(;ty.),  but  to  increase  the  diagonal  elements 
encountered  during  the  factorization  (where  necessary)  to  ensure  that  they  are  sufficiently 
positive.  This  modified  Cholesky  approach  is  designed  to  accomplish  two  goals:  It  guarantees 
that  the  modified  Cholesky  factors  exist  and  are  bounded  relative  to  the  norm  of  the  actual 
Hessian,  and  it  does  not  modify  the  Hessian  if  it  is  sufficiently  positive  definite. 

We  begin  our  description  of  this  approach  by  briefly  reviewing  the  Cholesky 
factorization.  Every  symmetric  positive  definite  matrix  A  can  be  written  as 

A  =  LDLt,  (3.46) 

where  L  is  a  lower  triangular  matrix  with  unit  diagonal  elements  and  D  is  a  diagonal  matrix 
with  positive  elements  on  the  diagonal.  By  equating  the  elements  in  (3.46),  column  by 
column,  it  is  easy  to  derive  formulas  for  computing  L  and  D. 


□  Example  3.1 


Consider  the  case  n  =  3.  The  equation  A  —  LDLT  is  given  by 


On 

O21 

031 

1 

0 

0 

d] 

0 

0 

1 

h\ 

hi 

021 

022 

032 

— 

h\ 

1 

0 

0 

ob 

0 

0 

1 

hi 

O31 

032 

033 

h\ 

132 

1 

0 

0 

d3 

0 

0 

1 

(The  notation  indicates  that  A  is  symmetric.)  By  equating  the  elements  of  the  first  column, 
we  have 


ei\ i  —  d\ , 

021  —  di/21  =>■  /21  —  Oii/di, 
O31  —  di/31  =>■  /31  —  Cl3\/d\. 
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Proceeding  with  the  next  two  columns,  we  obtain 


Cl  22  —  d\l\i  +  d2 

Cl32  —  +  «2‘32 

^33  —  ^1/31  +  ^2^32  +  di 


dl  =  CI22  ~  d\l\-\ , 

I32  —  («32  —  diluhi)  /d2, 

di  —  CI33  —  ^1/32  —  t^2  ^32  • 


□ 


This  procedure  is  generalized  in  the  following  algorithm. 


Algorithm  3.4  (Cholesky  Factorization,  LDLT  Form), 
for  j  —  1,2, ...  ,n 

cjj  ajj  ~  1  djis; 


uj  i-jj’ 

for  i  —  j  +  1, 

Cij  *  «/,  E,=  1  dsljsIjs , 


, « 
-j'-i 


// 


cij  jdj ; 


end 


end 


One  can  show  (see,  for  example,  Golub  and  Van  Loan  [136,  Section  4.2.3])  that  the 
diagonal  elements  djj  are  all  positive  whenever  A  is  positive  definite.  The  scalars  Cij  have 
been  introduced  only  to  facilitate  the  description  of  the  modified  factorization  discussed 
below.  We  should  note  that  Algorithm  3.4  differs  a  little  from  the  standard  form  of  the 
Cholesky  factorization,  which  produces  a  lower  triangular  matrix  M  such  that 

A  =  MMT .  (3.47) 

In  fact,  we  can  make  the  identification  M  —  LD1^2  to  relate  M  to  the  factors  L  and  D 
computed  in  Algorithm  3.4.  The  technique  for  computing  M  appears  as  Algorithm  A.2  in 
Appendix  A. 

If  A  is  indefinite,  the  factorization  A  —  LDLJ  may  not  exist.  Even  if  it  does  exist, 
Algorithm  3.4  is  numerically  unstable  when  applied  to  such  matrices,  in  the  sense  that  the 
elements  of  L  and  D  can  become  arbitrarily  large.  It  follows  that  a  strategy  of  computing 
the  LDLt  factorization  and  then  modifying  the  diagonal  after  the  fact  to  force  its  elements 
to  be  positive  may  break  down,  or  may  result  in  a  matrix  that  is  drastically  different  from  A. 

Instead,  we  can  modify  the  matrix  A  during  the  course  of  the  factorization  in  such 
a  way  that  all  elements  in  D  are  sufficiently  positive,  and  so  that  the  elements  of  D  and 
L  are  not  too  large.  To  control  the  quality  of  the  modification,  we  choose  two  positive 
parameters  S  and  fS,  and  require  that  during  the  computation  of  the  jth  columns  of  L  and 
D  in  Algorithm  3.4  (that  is,  for  each  j  in  the  outer  loop  of  the  algorithm)  the  following 
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bounds  be  satisfied: 


dj  >8,  \m,j  \<  P,  i  —  j  +  1,  j  +  2, . . . ,  n,  (3.48) 

where  m,7  —  lij^/dj.  To  satisfy  these  bounds  we  only  need  to  change  one  step  in  Algo¬ 
rithm  3.4:  The  formula  for  computing  the  diagonal  element  dj  in  Algorithm  3.4  is  replaced 
by 


dj  —  max 


with  dj  —  max  |c;7|. 

j<i<n 


(3.49) 


To  verify  that  (3.48)  holds,  we  note  from  Algorithm  3.4  that  c,7  =  Ujdj,  and  therefore 

I  I  17  /3“|  \Cij\  ^  \Cij\P  n  r  11  . 

\mij\  =  \lijy/dj\  =  —j=  <  — —  <  p,  for  all  7  >  j. 

V 

We  note  that  dj  can  be  computed  prior  to  dj  because  the  elements  c,-7  in  the  second 
for  loop  of  Algorithm  3.4  do  not  involve  dj.  In  fact,  this  is  the  reason  for  introducing  the 
quantities  c,-;  into  the  algorithm. 

These  observations  are  the  basis  of  the  modified  Cholesky  algorithm  described  in  detail 
in  Gill,  Murray,  and  Wright  [130],  which  introduces  symmetric  interchanges  of  rows  and 
columns  to  try  to  reduce  the  size  of  the  modification.  If  P  denotes  the  permutation  matrix 
associated  with  the  row  and  column  interchanges,  the  algorithm  produces  the  Cholesky 
factorization  of  the  permuted,  modified  matrix  P  APT  +  E,  that  is, 

PAPt  +  E  —  LDLt  —  MMT ,  (3.50) 

where  £  is  a  nonnegative  diagonal  matrix  that  is  zero  if  A  is  sufficiently  positive  definite. 
One  can  show  (More  and  Sorensen  [215])  that  the  matrices  Bk  obtained  by  this  modified 
Cholesky  algorithm  to  the  exact  Hessians  V2/(x^)  have  bounded  condition  numbers,  that 
is,  the  bound  (3.39)  holds  for  some  value  of  C. 

MODIFIED  SYMMETRIC  INDEFINITE  FACTORIZATION 

Another  strategy  for  modifying  an  indefinite  Hessian  is  to  use  a  procedure  based  on 
a  symmetric  indefinite  factorization.  Any  symmetric  matrix  A,  whether  positive  definite  or 
not,  can  be  written  as 


PAPt  =  LBLt ,  (3.51) 

where  L  is  unit  lower  triangular,  £  is  a  block  diagonal  matrix  with  blocks  of  dimension  1 
or  2,  and  P  is  a  permutation  matrix  (see  our  discussion  in  Appendix  A  and  also  Golub  and 
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Van  Loan  [136,  Section  4.4]).  We  mentioned  earlier  that  attempting  to  compute  the  LDLT 
factorization  of  an  indefinite  matrix  (where  D  is  a  diagonal  matrix)  is  inadvisable  because 
even  if  the  factors  L  and  D  are  well  defined,  they  may  contain  entries  that  are  larger  than  the 
original  elements  of  A,  thus  amplifying  rounding  errors  that  arise  during  the  computation. 
However,  by  using  the  block  diagonal  matrix  B,  which  allows  2x2  blocks  as  well  as  1  x  1 
blocks  on  the  diagonal,  we  can  guarantee  that  the  factorization  (3.51)  always  exists  and  can 
be  computed  by  a  numerically  stable  process. 


□  Example  3.2 

The  matrix 


0  12  3 

12  2  2 

A  = 

2  2  3  3 

3  2  3  4 

can  be  written  in  the  form  (3.51)  with  P  —  [ei,  e 3,  e{\. 


0 

0 

0 

0  3  0  0 

0  10  0 

3  4  0  0 

1  2 

B  — 

7  5 

-  -  1  0 

1  tj  — 

0  0  - 

9  3 

9  9 

2  1 

5  10 

-  -  0  1 

0  0  -  — 

L  9  3  J 

L  9  9  J 

Note  that  both  diagonal  blocks  in  B  are  2x2.  Several  algorithms  for  computing  symmetric 
indefinite  factorizations  are  discussed  in  Section  A.l  of  Appendix  A. 


The  symmetric  indefinite  factorization  allows  us  to  determine  the  inertia  of  a  matrix, 
that  is,  the  number  of  positive,  zero,  and  negative  eigenvalues.  One  can  show  that  the  inertia 
of  B  equals  the  inertia  of  A.  Moreover,  the  2x2  blocks  in  B  are  always  constructed  to 
have  one  positive  and  one  negative  eigenvalue.  Thus  the  number  of  positive  eigenvalues  in 
A  equals  the  number  of  positive  lxl  blocks  plus  the  number  of  2  x  2  blocks. 

As  for  the  Cholesky  factorization,  an  indefinite  symmetric  factorization  algorithm 
can  be  modified  to  ensure  that  the  modified  factors  are  the  factors  of  a  positive  definite 
matrix.  The  strategy  is  first  to  compute  the  factorization  (3.51),  as  well  as  the  spectral 
decomposition  B  —  Q  A  QT ,  which  is  inexpensive  to  compute  because  B  is  block  diagonal 
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(see  Exercise  3.12).  We  then  construct  a  modification  matrix  F  such  that 

L(B  +  F)Lt 

is  sufficiently  positive  definite.  Motivated  by  the  modified  spectral  decomposition  (3.43), 
we  choose  a  parameter  S  >  0  and  define  F  to  be 

f  0,  Xi  >  8, 

F  =  j2diag(t/)  Q  ,  t,  —  |  i  =  l,2, (3.53) 

I  8  —  Xi,  X{  <  8, 

where  Xj  are  the  eigenvalues  of  B.  The  matrix  F  is  thus  the  modification  of  minimum 
Frobenius  norm  that  ensures  that  all  eigenvalues  of  the  modified  matrix  B  +  F  are  no  less 
than  8.  This  strategy  therefore  modifies  the  factorization  (3.51)  as  follows: 

P(A  +  E)Pt  =  L(B  +  F)Lt,  where  E  =  PTLFLTP. 

(Note  that  E  will  not  be  diagonal,  in  general.)  Hence,  in  contrast  to  the  modified  Cholesky 
approach,  this  modification  strategy  changes  the  entire  matrix  A,  not  just  its  diagonal.  The 
aim  of  strategy  (3.53)  is  that  the  modified  matrix  satisfies  Xmjn(A  +  E)  ss  8  whenever  the 
original  matrix  A  has  Amin(A)  <  8.  It  is  not  clear,  however,  whether  it  always  comes  close 
to  attaining  this  goal. 


3.5  STEP-LENGTH  SELECTION  ALGORITHMS 


We  now  consider  techniques  for  finding  a  minimum  of  the  one-dimensional  function 


</>(a)  =  f(xic  +  apk). 


(3.54) 


or  for  simply  finding  a  step  length  a k  satisfying  one  of  the  termination  conditions  described 
in  Section  3.1.  We  assume  that  pk  is  a  descent  direction — that  is,  <f>'( 0)  <  0 — so  that  our 
search  can  be  confined  to  positive  values  of  a. 

If  /  is  a  convex  quadratic,  f(x)  —  \xT  Qx  —  bTx,  its  one-dimensional  minimizer 
along  the  ray  xk  +  apk  can  be  computed  analytically  and  is  given  by 


V/*r  Pk 


Pk  QPk 


(3.55) 


For  general  nonlinear  functions,  it  is  necessary  to  use  an  iterative  procedure.  The  line  search 
procedure  deserves  particular  attention  because  it  has  a  major  impact  on  the  robustness  and 
efficiency  of  all  nonlinear  optimization  methods. 
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Line  search  procedures  can  be  classified  according  to  the  type  of  derivative  information 
they  use.  Algorithms  that  use  only  function  values  can  be  inefficient  since,  to  be  theoretically 
sound,  they  need  to  continue  iterating  until  the  search  for  the  minimizer  is  narrowed  down 
to  a  small  interval.  In  contrast,  knowledge  of  gradient  information  allows  us  to  determine 
whether  a  suitable  step  length  has  been  located,  as  stipulated,  for  example,  by  the  Wolfe 
conditions  (3.6)  or  Goldstein  conditions  (3.11).  Often,  particularly  when  xk  is  close  to  the 
solution,  the  very  first  choice  of  a  satisfies  these  conditions,  so  the  line  search  need  not 
be  invoked  at  all.  In  the  rest  of  this  section,  we  discuss  only  algorithms  that  make  use  of 
derivative  information.  More  information  on  derivative -free  procedures  is  given  in  the  notes 
at  the  end  of  this  chapter. 

All  line  search  procedures  require  an  initial  estimate  a0  and  generate  a  sequence  {a,- } 
that  either  terminates  with  a  step  length  satisfying  the  conditions  specified  by  the  user  (for 
example,  the  Wolfe  conditions)  or  determines  that  such  a  step  length  does  not  exist.  Typical 
procedures  consist  of  two  phases:  a  bracketing  phase  that  finds  an  interval  [d,  b]  containing 
acceptable  step  lengths,  and  a  selection  phase  that  zooms  in  to  locate  the  final  step  length. 
The  selection  phase  usually  reduces  the  bracketing  interval  during  its  search  for  the  desired 
step  length  and  interpolates  some  of  the  function  and  derivative  information  gathered  on 
earlier  steps  to  guess  the  location  of  the  minimizer.  We  first  discuss  how  to  perform  this 
interpolation. 

In  the  following  discussion  we  let  ak  and  ak-i  denote  the  step  lengths  used  at  iterations 
k  and  k  —  1  of  the  optimization  algorithm,  respectively.  On  the  other  hand,  we  denote  the 
trial  step  lengths  generated  during  the  line  search  by  a,  and  a,_i  and  also  a j.  We  use  ctQ  to 
denote  the  initial  guess. 

INTERPOLATION 

We  begin  by  describing  a  line  search  procedure  based  on  interpolation  of  known 
function  and  derivative  values  of  the  function  0.  This  procedure  can  be  viewed  as  an 
enhancement  of  Algorithm  3.1.  The  aim  is  to  find  a  value  of  a  that  satisfies  the  sufficient 
decrease  condition  (3.6a),  without  being  “too  small.”  Accordingly,  the  procedures  here 
generate  a  decreasing  sequence  of  values  ay  such  that  each  value  ay  is  not  too  much  smaller 
than  its  predecessor  ay_i. 

Note  that  we  can  write  the  sufficient  decrease  condition  in  the  notation  of  (3.54)  as 

<P(oik)  <  0(0)  +  £ya*0'(O),  (3.56) 

and  that  since  the  constant  cy  is  usually  chosen  to  be  small  in  practice  (cy  =  10~4,  say),  this 
condition  asks  for  little  more  than  descent  in  /.  We  design  the  procedure  to  be  “efficient” 
in  the  sense  that  it  computes  the  derivative  V/(x)  as  few  times  as  possible. 

Suppose  that  the  initial  guess  ccq  is  given.  If  we  have 


<  0(0)  +  c1ao0'(O), 
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this  step  length  satisfies  the  condition,  and  we  terminate  the  search.  Otherwise,  we  know  that 
the  interval  [0,  «o]  contains  acceptable  step  lengths  (see  Figure  3.3).  We  form  a  quadratic 
approximation  <pq{a)  to  0  by  interpolating  the  three  pieces  of  information  available — 0(  0), 
<f>'( 0),  and  (p{a 0) — to  obtain 


(pq(a)  - 


( 0(a 0)  -  0(0)  -  ao0'( 0) 

V 


a2  +  0'(O  )a  +  0(0). 


(3.57) 


(Note  that  this  function  is  constructed  so  that  it  satisfies  the  interpolation  conditions 
04(0)  =  0(0),  0^(0)  =  0'(O),  and0?(ao)  =  0(a o)-)  The  new  trial  value  a  1  is  defined  as  the 
minimizer  of  this  quadratic,  that  is,  we  obtain 


_ 0'(O)«g _ 

2  [0(ao)  -  0(0)  -  0'(O)ao] 


(3.58) 


If  the  sufficient  decrease  condition  (3.56)  is  satisfied  at  oq,  we  terminate  the  search.  Oth¬ 
erwise,  we  construct  a  cubic  function  that  interpolates  the  four  pieces  of  information  0(0), 
0'( 0),  0(ao),  and  0(oq),  obtaining 


0c(a)  =  aa3  +  ficr  +  a0'(  0)  +  0(0), 


where 


a 

1 

« 0 

-«?  " 

0(a  1) 

-  0(0)  -  0'(O)ai 

b 

Uoal  (“1  -  “0) 

—a3 

«i 

0(a  0) 

-  0(0)  -  0'(O)ao 

By  differentiating  0c(x),  we  see  that  the  minimizer  aq  of  0t.  lies  in  the  interval  [0,  oq]  and  is 
given  by 


—b  +  yjb1  —  3fl0'(O) 


If  necessary,  this  process  is  repeated,  using  a  cubic  interpolant  of  0(0),  0'(O)  and  the  two 
most  recent  values  of  0,  until  an  a  that  satisfies  (3.56)  is  located.  If  any  a,  is  either  too 
close  to  its  predecessor  a,_i  or  else  too  much  smaller  than  a,_i,  we  reset  a,-  =  a,_i/2.  This 
safeguard  procedure  ensures  that  we  make  reasonable  progress  on  each  iteration  and  that 
the  final  a  is  not  too  small. 

The  strategy  just  described  assumes  that  derivative  values  are  significantly  more  ex¬ 
pensive  to  compute  than  function  values.  It  is  often  possible,  however,  to  compute  the 
directional  derivative  simultaneously  with  the  function,  at  little  additional  cost;  see  Chap¬ 
ter  8.  Accordingly,  we  can  design  an  alternative  strategy  based  on  cubic  interpolation  of  the 
values  of  0  and  0'  at  the  two  most  recent  values  of  a. 
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Cubic  interpolation  provides  a  good  model  for  functions  with  significant  changes  of 
curvature.  Suppose  we  have  an  interval  [a,  b]  known  to  contain  desirable  step  lengths,  and 
two  previous  step  length  estimates  a,_i  and  a,  in  this  interval.  We  use  a  cubic  function  to 
interpolate  </;(«,_ i),  <t>'(oii- 1),  </>(a,),  and  d>'(a,).  (This  cubic  function  always  exists  and  is 
unique;  see,  for  example,  Bulirsch  and  Stoer  [41,  p.  52].)  The  minimizer  of  this  cubic  in 
[d,  b]  is  either  at  one  of  the  endpoints  or  else  in  the  interior,  in  which  case  it  is  given  by 


oii+i  =  a,-  -  (a,-  -  a,_i) 


<P'(aj )  +  d2  -  di 
~  </>'(«;- i)  +  2d, 


(3.59) 


with 


d\  —  </>'(«, -i)  +  (p'(oij)  -  3 


</>(«/- i)  -  <P(oii) 


oi  i- 1  -  01  j 


d2  =  sign («,■  -  a,_i)  [d]  -  0/(ar,-_i )0'(aff- )] 


1/2 


The  interpolation  process  can  be  repeated  by  discarding  the  data  at  one  of  the  step 
lengths  a,_i  or  a,-  and  replacing  it  by  (p{aj+1)  and  <//(a,-+1).  The  decision  on  which  of  a,_i 
and  a,  should  be  kept  and  which  discarded  depends  on  the  specific  conditions  used  to 
terminate  the  line  search;  we  discuss  this  issue  further  below  in  the  context  of  the  Wolfe 
conditions.  Cubic  interpolation  is  a  powerful  strategy,  since  it  usually  produces  a  quadratic 
rate  of  convergence  of  the  iteration  (3.59)  to  the  minimizing  value  of  a. 


INITIAL  STEP  LENGTH 

For  Newton  and  quasi-Newton  methods,  the  step  ao  =  1  should  always  be  used  as 
the  initial  trial  step  length.  This  choice  ensures  that  unit  step  lengths  are  taken  whenever 
they  satisfy  the  termination  conditions  and  allows  the  rapid  rate-of-convergence  properties 
of  these  methods  to  take  effect. 

For  methods  that  do  not  produce  well  scaled  search  directions,  such  as  the  steepest  de¬ 
scent  and  conjugate  gradient  methods,  it  is  important  to  use  current  information  about  the 
problem  and  the  algorithm  to  make  the  initial  guess.  A  popular  strategy  is  to  assume  that  the 
first-order  change  in  the  function  at  iterate  Xk  will  be  the  same  as  that  obtained  at  the  previ¬ 
ous  step.  In  other  words,  we  choose  the  initial  guess  ao  so  that  a0  V// pk  —  an  ^  fk_\Pk-u 
that  is, 


ao  =  otk- 1 


V/n/A-i 

VfiPk 


Another  useful  strategy  is  to  interpolate  a  quadratic  to  the  data  f{xk- 1),  f{xk),  and 
V  fl_ ,  Pk- 1  and  to  define  a0  to  be  its  minimizer.  This  strategy  yields 


2 (fk  -  ft- 1) 
0'(  0) 


a0  = 


(3.60) 
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It  can  be  shown  that  if  x*  ->  x*  superlinearly,  then  the  ratio  in  this  expression  converges  to 
1.  If  we  adjust  the  choice  (3.60)  by  setting 


cuo  min(l,  l.Olao), 

we  find  that  the  unit  step  length  ccq  =  1  will  eventually  always  be  tried  and  accepted,  and  the 
superlinear  convergence  properties  of  Newton  and  quasi-Newton  methods  will  be  observed. 


A  LINE  SEARCH  ALGORITHM  FOR  THE  WOLFE  CONDITIONS 

The  Wolfe  (or  strong  Wolfe)  conditions  are  among  the  most  widely  applicable  and 
useful  termination  conditions.  We  now  describe  in  some  detail  a  one-dimensional  search 
procedure  that  is  guaranteed  to  find  a  step  length  satisfying  the  strong  Wolfe  conditions  (3.7) 
for  any  parameters  C\  and  C2  satisfying  0  <  Ci  <  C2  <  1.  As  before,  we  assume  that  p  is  a 
descent  direction  and  that  /  is  bounded  below  along  the  direction  p. 

The  algorithm  has  two  stages.  This  first  stage  begins  with  a  trial  estimate  a\,  and  keeps 
increasing  it  until  it  finds  either  an  acceptable  step  length  or  an  interval  that  brackets  the 
desired  step  lengths.  In  the  latter  case,  the  second  stage  is  invoked  by  calling  a  function  called 
zoom  (Algorithm  3.6,  below),  which  successively  decreases  the  size  of  the  interval  until  an 
acceptable  step  length  is  identified. 

A  formal  specification  of  the  line  search  algorithm  follows.  We  refer  to  (3.7a)  as  the 
sufficient  decrease  condition  and  to  (3.7b)  as  the  curvature  condition.  The  parameter  amax 
is  a  user-supplied  bound  on  the  maximum  step  length  allowed.  The  line  search  algorithm 
terminates  with  a*  set  to  a  step  length  that  satisfies  the  strong  Wolfe  conditions. 

Algorithm  3.5  (Line  Search  Alsorithm). 

Set  ao  •<—  0,  choose  amax  >  0  and  a\  e  (0,  amax); 
i  <-  1; 

repeat 

Evaluate  0(a(); 

if  </>(«,)  >  0(0)  +  Cia,-0'(O)  or  [</>(“;)  >  0(ar,-_ i)  and  i  >  1] 
a*  zoom(a,_i,  a,  )  and  stop; 

Evaluate  0'(a(); 

if  \<t>'{cti)\  <  -c20'(O) 

set  a*  a,-  and  stop; 

if  </>'(«,)  >  0 

set  a*  ^zoom(a;,  ar,-_i)  and  stop; 

Choose  a, +i  e  (a,-,amax); 

i  i  +  1; 

end  (repeat) 
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Note  that  the  sequence  of  trial  step  lengths  {a,- }  is  monotonically  increasing,  but  that 
the  order  of  the  arguments  supplied  to  the  zoom  function  may  vary.  The  procedure  uses 
the  knowledge  that  the  interval  (a,_i,  a,-)  contains  step  lengths  satisfying  the  strong  Wolfe 
conditions  if  one  of  the  following  three  conditions  is  satisfied: 

(i)  a,  violates  the  sufficient  decrease  condition; 

(ii)  <p(a i)  >  </>(<*,_ i); 

(iii)  (f>'(a i)  >  0. 

The  last  step  of  the  algorithm  performs  extrapolation  to  find  the  next  trial  value  a  1+1.  To 
implement  this  step  we  can  use  approaches  like  the  interpolation  procedures  above,  or 
we  can  simply  set  a,+i  to  some  constant  multiple  of  a,.  Whichever  strategy  we  use,  it  is 
important  that  the  successive  steps  increase  quickly  enough  to  reach  the  upper  limit  amax  in 
a  finite  number  of  iterations. 

We  now  specify  the  function  zoom,  which  requires  a  little  explanation.  The  order  of 
its  input  arguments  is  such  that  each  call  has  the  form  zoom(ai0,  ay))  where 

(a)  the  interval  bounded  by  aiQ  and  an  contains  step  lengths  that  satisfy  the  strong  Wolfe 
conditions; 

(b)  aio  is,  among  all  step  lengths  generated  so  far  and  satisfying  the  sufficient  decrease 
condition,  the  one  giving  the  smallest  function  value;  and 

(c)  ah;  is  chosen  so  that  cp'(a\0)(ahi  —  ai0)  <  0. 

Each  iteration  of  zoom  generates  an  iterate  aj  between  ai0  and  a y,  and  then  replaces  one 
of  these  endpoints  by  otj  in  such  a  way  that  the  properties  (a),  (b),  and  (c)  continue  to  hold. 

Algorithm  3.6  (zoom). 

repeat 

Interpolate  (using  quadratic,  cubic,  or  bisection)  to  find 
a  trial  step  length  a j  between  ai0  and  an; 

Evaluate  (p(oij); 

>  0(0)  +  Cia/0'(O)  or  4>(aj)  >  0(aio) 

(X  hi  ^  ^  j  ■> 

else 

Evaluate  (p'(oij ); 

if  <  -c20'( 0) 

Set  a*  <r-  aj  and  stop; 
if0'(o!y  )(aihi  -  aio)  >  0 
Q?hi  ^  C^lo) 
oc\q  ^  a  j , 


end  (repeat) 


62  Chapter  3.  Line  Search  Methods 


If  the  new  estimate  ay  happens  to  satisfy  the  strong  Wolfe  conditions,  then  zoom  has  served 
its  purpose  of  identifying  such  a  point,  so  it  terminates  with  cu*  =  ay .  Otherwise,  if  ay 
satisfies  the  sufficient  decrease  condition  and  has  a  lower  function  value  than  x\oy  then  we 
set  trio  <r-  ay  to  maintain  condition  (b).  If  this  setting  results  in  a  violation  of  condition  (c), 
we  remedy  the  situation  by  setting  a j,;  to  the  old  value  of  u\Q.  Readers  should  sketch  some 
graphs  to  see  for  themselves  how  zoom  works! 

As  mentioned  earlier,  the  interpolation  step  that  determines  ay  should  be  safeguarded 
to  ensure  that  the  new  step  length  is  not  too  close  to  the  endpoints  of  the  interval.  Practical 
line  search  algorithms  also  make  use  of  the  properties  of  the  interpolating  polynomials  to 
make  educated  guesses  of  where  the  next  step  length  should  lie;  see  [39,  216].  A  problem 
that  can  arise  is  that  as  the  optimization  algorithm  approaches  the  solution,  two  consecutive 
function  values  f{xk)  and  /(x*_  l)  maybe  indistinguishable  in  finite-precision  arithmetic. 
Therefore,  the  line  search  must  include  a  stopping  test  if  it  cannot  attain  a  lower  function 
value  after  a  certain  number  (typically,  ten)  of  trial  step  lengths.  Some  procedures  also 
stop  if  the  relative  change  in  x  is  close  to  machine  precision,  or  to  some  user-specified 
threshold. 

A  line  search  algorithm  that  incorporates  all  these  features  is  difficult  to  code.  We 
advocate  the  use  of  one  of  the  several  good  software  implementations  available  in  the 
public  domain.  See  Dennis  and  Schnabel  [92],  Lemarechal  [189],  Fletcher  [101],  More  and 
Thuente  [216]  (in  particular),  and  Hager  and  Zhang  [161]. 

One  may  ask  how  much  more  expensive  it  is  to  require  the  strong  Wolfe  conditions 
instead  of  the  regular  Wolfe  conditions.  Our  experience  suggests  that  for  a  “loose”  line 
search  (with  parameters  such  as  cy  =  10-4  and  c 2  =  0.9),  both  strategies  require  a  similar 
amount  of  work.  The  strong  Wolfe  conditions  have  the  advantage  that  by  decreasing  C2  we 
can  directly  control  the  quality  of  the  search,  by  forcing  the  accepted  value  of  a  to  lie  closer 
to  a  local  minimum.  This  feature  is  important  in  steepest  descent  or  nonlinear  conjugate 
gradient  methods,  and  therefore  a  step  selection  routine  that  enforces  the  strong  Wolfe 
conditions  has  wide  applicability. 

NOTES  AND  REFERENCES 

For  an  extensive  discussion  of  line  search  termination  conditions  see  Ortega  and 
Rheinboldt  [230] .  Akaike  [2]  presents  a  probabilistic  analysis  of  the  steepest  descent  method 
with  exact  line  searches  on  quadratic  functions.  He  shows  that  when  n  >  2,  the  worst-case 
bound  (3.29)  can  be  expected  to  hold  for  most  starting  points.  The  case  n  —  2  can  be 
studied  in  closed  form;  see  Bazaraa,  Sherali,  and  Shetty  [  14] .  Theorem  3.6  is  due  to  Dennis 
and  More. 

Some  line  search  methods  (see  Goldfarb  [132]  and  More  and  Sorensen  [213])  compute 
a  direction  of  negative  curvature,  whenever  it  exists,  to  prevent  the  iteration  from  converging 
to  nonminimizing  stationary  points.  A  direction  of  negative  curvature  p-  is  one  that  satisfies 
P-V2  f(xk)p~  <  0.  These  algorithms  generate  a  search  direction  by  combining  p_  with  the 
steepest  descent  direction  —  V/*,  often  performing  a  curvilinear  backtracking  line  search. 
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It  is  difficult  to  determine  the  relative  contributions  of  the  steepest  descent  and  negative 
curvature  directions.  Because  of  this  fact,  the  approach  fell  out  of  favor  after  the  introduction 
of  trust-region  methods. 

For  a  more  thorough  treatment  of  the  modified  Cholesky  factorization  see  Gill, 
Murray,  and  Wright  [130]  or  Dennis  and  Schnabel  [92] .  A  modified  Cholesky  factorization 
based  on  Gershgorin  disk  estimates  is  described  in  Schnabel  and  Eskow  [276] .  The  modified 
indefinite  factorization  is  from  Cheng  and  Higham  [58]. 

Another  strategy  for  implementing  a  line  search  Newton  method  when  the  Hessian 
contains  negative  eigenvalues  is  to  compute  a  direction  of  negative  curvature  and  use  it  to 
define  the  search  direction  (see  More  and  Sorensen  [213]  and  Goldfarb  [132]). 

Derivative-free  line  search  algorithms  include  golden  section  and  Fibonacci  search. 
They  share  some  of  the  features  with  the  line  search  method  given  in  this  chapter.  They 
typically  store  three  trial  points  that  determine  an  interval  containing  a  one-dimensional 
minimizer.  Golden  section  and  Fibonacci  differ  in  the  way  in  which  the  trial  step  lengths  are 
generated;  see,  for  example,  [79,  39]. 

Our  discussion  of  interpolation  follows  Dennis  and  Schnabel  [92],  and  the  algorithm 
for  finding  a  step  length  satisfying  the  strong  Wolfe  conditions  can  be  found  in  Fletcher 
[101]. 


i#7  Exercises 

i#7  3.1  Program  the  steepest  descent  and  Newton  algorithms  using  the  backtracking  line 

search,  Algorithm  3.1.  Use  them  to  minimize  the  Rosenbrock  function  (2.22).  Set  the  initial 
step  length  cro  =  1  and  print  the  step  length  used  by  each  method  at  each  iteration.  First  try 
the  initial  point  Xq  —  (1.2,  1. 2) T  andthenthemoredifficultstartingpoint.ro  =  (—1-2,  l)r. 

i#7  3.2  Show  that  if  0  <  C2  <  C\  <  1,  there  maybe  no  step  lengths  that  satisfy  the  Wolfe 

conditions. 

i#7  3.3  Show  that  the  one-dimensional  minimizer  of  a  strongly  convex  quadratic  function 

is  given  by  (3.55). 

i#7  3.4  Show  that  the  one-dimensional  minimizer  of  a  strongly  convex  quadratic  function 

always  satisfies  the  Goldstein  conditions  (3.11). 

i#7  3.5  Prove  that  || 5.r ||  >  || x||/|| B~ *||  for  any  nonsingular  matrix  B.  Use  this  fact  to 

establish  (3.19). 

i#7  3.6  Consider  the  steepest  descent  method  with  exact  line  searches  applied  to  the 

convex  quadratic  function  (3.24).  Using  the  properties  given  in  this  chapter,  show  that  if  the 
initial  point  is  such  that  Xq  —  x*  is  parallel  to  an  eigenvector  of  Q,  then  the  steepest  descent 
method  will  find  the  solution  in  one  step. 
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&  3.7  Prove  the  result  (3.28)  by  working  through  the  following  steps.  First,  use  (3.26) 

to  show  that 


II**  -  x*\\q  -  ll**+i  -  **llg  =  2 akVfkT Q(xk  -  x*)  -  a2V  fk  QWfk, 


where  ||  •  ||  g  is  defined  by  (3.27).  Second,  use  the  fact  that  V  fk  —  Q{xk  —  x*)  to  obtain 


II**  -  **llg  -  ||**+i  -  **llg  = 


2  (V//V/*)2 
(V/tr  QVfk) 


<yfiVfk)2 

(V//  GV/t) 


and 


II**-**  II  e  =  v/trG_1v/t. 

&  3.8  Let  Q  be  a  positive  definite  symmetric  matrix.  Prove  that  for  any  vector  x,  we 

have 


(xTx)2  4A.„A.i 

(xrGx)(x7’g“1x)  _  (A„+Ai)2’ 

where  A.,,  and  Ai  are,  respectively,  the  largest  and  smallest  eigenvalues  of  Q.  (This  relation, 
which  is  known  as  the  Kantorovich  inequality,  can  be  used  to  deduce  (3.29)  from  (3.28).) 

&  3.9  Program  the  BFGS  algorithm  using  the  line  search  algorithm  described  in  this 

chapter  that  implements  the  strong  Wolfe  conditions.  Have  the  code  verify  that  yk  sk  is 
always  positive.  Use  it  to  minimize  the  Rosenbrock  function  using  the  starting  points  given 
in  Exercise  3.1. 

&  3.10  Compute  the  eigenvalues  of  the  2  diagonal  blocks  of  (3.52)  and  verify  that  each 

block  has  a  positive  and  a  negative  eigenvalue.  Then  compute  the  eigenvalues  of  A  and  verify 
that  its  inertia  is  the  same  as  that  of  B . 

&  3.11  Describe  the  effect  that  the  modified  Cholesky  factorization  (3.50)  would  have 

on  the  Hessian  V2f(xk)  —  diag(— 2,  12,  4). 

3.12  Consider  a  block  diagonal  matrix  B  with  lxl  and  2x2  blocks.  Show  that  the 
eigenvalues  and  eigenvectors  of  B  can  be  obtained  by  computing  the  spectral  decomposition 
of  each  diagonal  block  separately. 

3.13  Show  that  the  quadratic  function  that  interpolates  0(0),  </)'{ 0),  and  (p(o/o)  is 
given  by  (3.57).  Then,  make  use  of  the  fact  that  the  sufficient  decrease  condition  (3.6a)  is 
not  satisfied  at  a0  to  show  that  this  quadratic  has  positive  curvature  and  that  the  minimizer 
satisfies 


«o 


< 


2(1 -ci) 
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Since  c i  is  chosen  to  be  quite  small  in  practice,  this  inequality  indicates  that  a \  cannot  be 
much  greater  than  |  (and  may  be  smaller),  which  gives  us  an  idea  of  the  new  step  length. 

3.14  If  </>(ao)  is  large,  (3.58)  shows  that  «i  can  be  quite  small.  Give  an  example  of  a 
function  and  a  step  length  a0  for  which  this  situation  arises.  (Drastic  changes  to  the  estimate 
of  the  step  length  are  not  desirable,  since  they  indicate  that  the  current  interpolant  does  not 
provide  a  good  approximation  to  the  function  and  that  it  should  be  modified  before  being 
trusted  to  produce  a  good  step  length  estimate.  In  practice,  one  imposes  a  lower  bound — 
typically,  p  —  0.1 — and  defines  the  new  step  length  as  a,-  =  max(pa,_i,  a,),  where  d;  is  the 
minimizer  of  the  interpolant.) 

i#7  3.15  Suppose  that  the  sufficient  decrease  condition  (3.6a)  is  not  satisfied  at  the  step 

lengths  ao>  and  a\,  and  consider  the  cubic  interpolating  0(0),  0'( 0),  <p(ao )  and  0(ai). 
By  drawing  graphs  illustrating  the  two  situations  that  can  arise,  show  that  the  mini¬ 
mizer  of  the  cubic  lies  in  [0,  oq].  Then  show  that  if  0(0)  <  0(a i),  the  minimizer  is 
less  than  |ai. 


Chapter 


Trust-Region 

Methods 


Line  search  methods  and  trust-region  methods  both  generate  steps  with  the  help  of  a 
quadratic  model  of  the  objective  function,  but  they  use  this  model  in  different  ways.  Line 
search  methods  use  it  to  generate  a  search  direction,  and  then  focus  their  efforts  on  finding 
a  suitable  step  length  a  along  this  direction.  Trust-region  methods  define  a  region  around 
the  current  iterate  within  which  they  trust  the  model  to  be  an  adequate  representation  of 
the  objective  function,  and  then  choose  the  step  to  be  the  approximate  minimizer  of  the 
model  in  this  region.  In  effect,  they  choose  the  direction  and  length  of  the  step  simul¬ 
taneously.  If  a  step  is  not  acceptable,  they  reduce  the  size  of  the  region  and  find  a  new 
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minimizer.  In  general,  the  direction  of  the  step  changes  whenever  the  size  of  the  trust  region 
is  altered. 

The  size  of  the  trust  region  is  critical  to  the  effectiveness  of  each  step.  If  the  region  is 
too  small,  the  algorithm  misses  an  opportunity  to  take  a  substantial  step  that  will  move  it 
much  closer  to  the  minimizer  of  the  objective  function.  If  too  large,  the  minimizer  of  the 
model  may  be  far  from  the  minimizer  of  the  objective  function  in  the  region,  so  we  may  have 
to  reduce  the  size  of  the  region  and  try  again.  In  practical  algorithms,  we  choose  the  size  of 
the  region  according  to  the  performance  of  the  algorithm  during  previous  iterations.  If  the 
model  is  consistently  reliable,  producing  good  steps  and  accurately  predicting  the  behavior 
of  the  objective  function  along  these  steps,  the  size  of  the  trust  region  may  be  increased  to 
allow  longer,  more  ambitious,  steps  to  be  taken.  A  failed  step  is  an  indication  that  our  model 
is  an  inadequate  representation  of  the  objective  function  over  the  current  trust  region.  After 
such  a  step,  we  reduce  the  size  of  the  region  and  try  again. 

Figure  4.1  illustrates  the  trust-region  approach  on  a  function  /  of  two  variables  in 
which  the  current  point  xu  and  the  minimizer  x*  lie  at  opposite  ends  of  a  curved  valley. 
The  quadratic  model  function  m *,  whose  elliptical  contours  are  shown  as  dashed  lines,  is 
constructed  from  function  and  derivative  information  at  x k  and  possibly  also  on  information 
accumulated  from  previous  iterations  and  steps.  A  line  search  method  based  on  this  model 
searches  along  the  step  to  the  minimizer  of  m*  (shown),  but  this  direction  will  yield  at  most 
a  small  reduction  in  /,  even  if  the  optimal  steplength  is  used.  The  trust-region  method 
steps  to  the  minimizer  of  m u  within  the  dotted  circle  (shown),  yielding  a  more  significant 
reduction  in  /  and  better  progress  toward  the  solution. 

In  this  chapter,  we  will  assume  that  the  model  function  nif,  that  is  used  at  each 
iterate  Xk  is  quadratic.  Moreover,  m is  based  on  the  Taylor-series  expansion  of  /  around 


Figure  4.1  Trust-region  and  line  search  steps. 
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xk,  which  is 


f{xk  +  p)  =  fk  +  gTk  P  +  \pTV2f{xk  +  tp)p,  (4.1) 

where  fk  —  f{xk)  and  gk  —  V  f{xk),  and  t  is  some  scalar  in  the  interval  (0,  1).  By  using  an 
approximation  Bk  to  the  Hessian  in  the  second-order  term,  mk  is  defined  as  follows: 

mk(p)  =  fk  +  8kP+  \pTBkp,  (4.2) 

where  Bk  is  some  symmetric  matrix.  The  difference  between  mk(p)  and  f(xk  +  p)  is 
O  ( ||p||2),  which  is  small  when  p  is  small. 

When  Bk  is  equal  to  the  true  Hessian  V2/(x*),  the  approximation  error  in  the  model 
function  mk  is  O  ( ||  p  || 3) ,  so  this  model  is  especially  accurate  when  ||  p  ||  is  small.  This  choice 
Bk  =  V2/(xfc)  leads  to  the  trust-region  Newton  method,  and  will  be  discussed  further  in 
Section  4.4.  In  other  sections  of  this  chapter,  we  emphasize  the  generality  of  the  trust-region 
approach  by  assuming  little  about  Bk  except  symmetry  and  uniform  boundedness. 

To  obtain  each  step,  we  seek  a  solution  of  the  subproblem 

min  mk(p)  =  fk  +  gTk  p  +  \pT Bkp  s.t.  ||p||  <  A*,  (4.3) 

pe  R“ 

where  >  0  is  the  trust-region  radius.  In  most  of  our  discussions,  we  define  ||  •  ||  to  be 
the  Euclidean  norm,  so  that  the  solution  pk  of  (4.3)  is  the  minimizer  of  mk  in  the  ball  of 
radius  Ak.  Thus,  the  trust-region  approach  requires  us  to  solve  a  sequence  of  subproblems 
(4.3)  in  which  the  objective  function  and  constraint  (which  can  be  written  as  pT p  <  A2k) 
are  both  quadratic.  When  Bk  is  positive  definite  and  \\Bk  gk\\  <  Ak,  the  solution  of  (4.3)  is 
easy  to  identify — it  is  simply  the  unconstrained  minimum  pBk  —  —Bklgk  of  the  quadratic 
mk{p).  In  this  case,  we  call  pk  the  full  step.  The  solution  of  (4.3)  is  not  so  obvious  in  other 
cases,  but  it  can  usually  be  found  without  too  much  computational  expense.  In  any  case, 
as  described  below,  we  need  only  an  approximate  solution  to  obtain  convergence  and  good 
practical  behavior. 

OUTLINE  OF  THE  TRUST-REGION  APPROACH 

One  of  the  key  ingredients  in  a  trust-region  algorithm  is  the  strategy  for  choosing  the 
trust-region  radius  Ak  at  each  iteration.  We  base  this  choice  on  the  agreement  between  the 
model  function  mk  and  the  objective  function  /  at  previous  iterations.  Given  a  step  pk  we 
define  the  ratio 


f(xk)  -  f(xk  +  pk)  _ 
mk{Q)-mk{pk) 


(4.4) 


the  numerator  is  called  the  actual  reduction,  and  the  denominator  is  the  predicted  reduction 
(that  is,  the  reduction  in  /  predicted  by  the  model  function).  Note  that  since  the  step  pk 
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is  obtained  by  minimizing  the  model  m a  over  a  region  that  includes  p  —  0,  the  predicted 
reduction  will  always  be  nonnegative.  Hence,  if  pu  is  negative,  the  new  objective  value 
f{xk  +  pu )  is  greater  than  the  current  value  f{xk),  so  the  step  must  be  rejected.  On  the 
other  hand,  if  pk  is  close  to  1,  there  is  good  agreement  between  the  model  nik  and  the 
function  /  over  this  step,  so  it  is  safe  to  expand  the  trust  region  for  the  next  iteration.  If  pk 
is  positive  but  significantly  smaller  than  1,  we  do  not  alter  the  trust  region,  but  if  it  is  close 
to  zero  or  negative,  we  shrink  the  trust  region  by  reducing  A k  at  the  next  iteration. 

The  following  algorithm  describes  the  process. 

Algorithm  4.1  (Trust  Region). 

Given  A  >  0,  A0  e  (0,  A),  and  ij  e  [0,  |): 

for  k  —  0, 1,  2, . . . 

Obtain  pk  by  (approximately)  solving  (4.3); 

Evaluate  Pk  from  (4.4); 

if  Pk  <  \ 

Aa+i  =  |  A  k 

else 

if  Pa-  >  |  and  ||p*||  =  Ak 
Aa+i  =  min(2AA,  A) 

else 

Aa+i  =  Aa; 

if  pk  >  r) 

xk+i  —  Xk  +  Pk 

else 

%k+\  — 

end  (for). 

Here  A  is  an  overall  bound  on  the  step  lengths.  Note  that  the  radius  is  increased  only  if  ||  pa  II 
actually  reaches  the  boundary  of  the  trust  region.  If  the  step  stays  strictly  inside  the  region, 
we  infer  that  the  current  value  of  Aa  is  not  interfering  with  the  progress  of  the  algorithm, 
so  we  leave  its  value  unchanged  for  the  next  iteration. 

To  turn  Algorithm  4.1  into  a  practical  algorithm,  we  need  to  focus  on  solving  the 
trust-region  subproblem  (4.3).  In  discussing  this  matter,  we  sometimes  drop  the  iteration 
subscript  k  and  restate  the  problem  (4.3)  as  follows: 

minm(p)  =  /  +  gT p  +  \pT Bp  s.t.  ||p||  <  A.  (4.5) 

peR" 

A  first  step  to  characterizing  exact  solutions  of  (4.5)  is  given  by  the  following  theorem 
(due  to  More  and  Sorensen  [214]),  which  shows  that  the  solution  p*  of  (4.5)  satisfies 

(B  +  XI)p*  =  -g  (4.6) 


for  some  X  >  0. 
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Theorem  4.1. 

The  vector  p*  is  a  global  solution  of  the  trust-region  problem 

minm(p)  —  /  +  gT p  +  \pT Bp,  s.t.  ||/>||  <  A,  (4.7) 

pe  R"  L 

if  and  only  if  p*  is  feasible  and  there  is  a  scalar  X  >  0  such  that  the  following  conditions  are 
satisfied: 


{B  +  XI)p*  —  —g,  (4.8a) 

X(A  - \\p*\\)  =  0,  (4.8b) 

[B  +  XI)  is  positive  semidefinite.  (4.8c) 

We  delay  the  proof  of  this  result  until  Section  4.3,  and  instead  discuss  just  its  key 
features  here  with  the  help  of  Figure  4.2.  The  condition  (4.8b)  is  a  complementarity  condition 
that  states  that  at  least  one  of  the  nonnegative  quantities  X  and  (A  —  ||p*||)  must  be  zero. 
Hence,  when  the  solution  lies  strictly  inside  the  trust  region  (as  it  does  when  A  =  Ai  in 
Figure  4.2),  we  must  have  1  =  0  and  so  Bp*  —  —g  with  B  positive  semidefinite,  from  (4.8a) 
and  (4.8c),  respectively.  In  the  other  cases  A  =  A2  and  A  =  A3,  we  have  ||/?*||  =  A,  and 
so  X  is  allowed  to  take  a  positive  value.  Note  from  (4.8a)  that 

Xp*  —  —Bp*  —  g  —  — Vm(p*). 


Figure  4.2  Solution  of  trust-region  subproblem  for  different  radii  A1,  A2,  A3. 
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Thus,  when  X  >  0,  the  solution  p*  is  collinear  with  the  negative  gradient  of  m  and  normal 
to  its  contours.  These  properties  can  be  seen  in  Figure  4.2. 

In  Section  4.1,  we  describe  two  strategies  for  finding  approximate  solutions  of  the 
subproblem  (4.3),  which  achieve  at  least  as  much  reduction  in  mk  as  the  reduction  achieved 
by  the  so-called  Cauchy  point.  This  point  is  simply  the  minimizer  of  m k  along  the  steepest 
descent  direction  —  gk.  subject  to  the  trust-region  bound.  The  first  approximate  strategy  is 
the  dogleg  method,  which  is  appropriate  when  the  model  Hessian  Bk  is  positive  definite.  The 
second  strategy,  known  as  two-dimensional  subspace  minimization,  can  be  applied  when  Bk 
is  indefinite,  though  it  requires  an  estimate  of  the  most  negative  eigenvalue  of  this  matrix. 
A  third  strategy,  described  in  Section  7.1,  uses  an  approach  based  on  the  conjugate  gradient 
method  to  minimize  nik,  and  can  therefore  be  applied  when  B  is  large  and  sparse. 

Section  4.3  is  devoted  to  a  strategy  in  which  an  iterative  method  is  used  to  identify  the 
value  of  X  for  which  (4.6)  is  satisfied  by  the  solution  of  the  subproblem.  We  prove  global 
convergence  results  in  Section  4.2.  Section  4.4  discusses  the  trust-region  Newton  method,  in 
which  the  Hessian  Bk  of  the  model  function  is  equal  to  the  Hessian  V2  /(** )  of  the  objective 
function.  The  key  result  of  this  section  is  that,  when  the  trust-region  Newton  algorithm  con¬ 
verges  to  a  point  x *  satisfying  second-order  sufficient  conditions,  it  converges  superlinearly. 


4.1  ALGORITHMS  BASED  ON  THE  CAUCHY  POINT 

THE  CAUCHY  POINT 

As  we  saw  in  Chapter  3,  line  search  methods  can  be  globally  convergent  even  when  the 
optimal  step  length  is  not  used  at  each  iteration.  In  fact,  the  step  length  ak  need  only  satisfy 
fairly  loose  criteria.  A  similar  situation  applies  in  trust-region  methods.  Although  in  principle 
we  seek  the  optimal  solution  of  the  subproblem  (4.3),  it  is  enough  for  purposes  of  global 
convergence  to  find  an  approximate  solution  pk  that  lies  within  the  trust  region  and  gives  a 
sufficient  reduction  in  the  model.  The  sufficient  reduction  can  be  quantified  in  terms  of  the 
Cauchy  point,  which  we  denote  by  pk  and  define  in  terms  of  the  following  simple  procedure. 

Algorithm  4.2  (Cauchy  Point  Calculation). 

Find  the  vector  psk  that  solves  a  linear  version  of  (4.3),  that  is, 

Pk  =  argmin  fk  +  gTk  p  s.t.  ||p||  <  A*;  (4.9) 

pe  R" 

Calculate  the  scalar  rk  >  0  that  minimizes  mk(T pf)  subject  to 
satisfying  the  trust-region  bound,  that  is, 

rk  =  argmin  mk(rpsk)  s.t.  \\rpk\\  <  Ak;  (4.10) 


Set  p\  =  r kp\. 
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It  is  easy  to  write  down  a  closed-form  definition  of  the  Cauchy  point.  For  a  start,  the 
solution  of  (4.9)  is  simply 


Pk  = 


A  A- 

tel8*' 


To  obtain  r*  explicitly,  we  consider  the  cases  of  g[ Bkgk  <  0  and  gj) Bkgk  >  0  separately.  For 
the  former  case,  the  function  mk(rpsk)  decreases  monotonically  with  r  whenever  g*  ^  0, 
so  tk  is  simply  the  largest  value  that  satisfies  the  trust-region  bound,  namely,  r*-  =  1.  For 
the  case  gl Bkgk  >  0,  /«a  (t p\)  is  a  convex  quadratic  in  r,  so  tk  is  either  the  unconstrained 
minimizer  of  this  quadratic,  1 1  gk  \  | 3  /  ( A^-  gk  Bk  gk ) ,  or  the  boundary  value  1 ,  whichever  comes 
first.  In  summary,  we  have 


Pk  =  ~Xk 


A  k 

tal®1. 


where 


f  1  if  g'k  Bkgk  <  0; 

}  min(\\gkf/{Akgl Bkgk),  l)  otherwise. 


(4.11) 


(4.12) 


Figure  4.3  illustrates  the  Cauchy  point  for  a  subproblem  in  which  Bk  is  positive 
definite.  In  this  example,  p'k  lies  strictly  inside  the  trust  region. 

The  Cauchy  step  p\  is  inexpensive  to  calculate — no  matrix  factorizations  are 
required — and  is  of  crucial  importance  in  deciding  if  an  approximate  solution  of  the 
trust-region  subproblem  is  acceptable.  Specifically,  a  trust-region  method  will  be  globally 
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Figure  4.3  The  Cauchy  point. 
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convergent  if  its  steps  pu  give  a  reduction  in  the  model  m  k  that  is  at  least  some  fixed  positive 
multiple  of  the  decrease  attained  by  the  Cauchy  step. 


IMPROVING  ON  THE  CAUCHY  POINT 

Since  the  Cauchy  point  p £  provides  sufficient  reduction  in  the  model  function  m* 
to  yield  global  convergence,  and  since  the  cost  of  calculating  it  is  so  small,  why  should 
we  look  any  further  for  a  better  approximate  solution  of  (4.3)?  The  reason  is  that  by 
always  taking  the  Cauchy  point  as  our  step,  we  are  simply  implementing  the  steepest 
descent  method  with  a  particular  choice  of  step  length.  As  we  have  seen  in  Chap¬ 
ter  3,  steepest  descent  performs  poorly  even  if  an  optimal  step  length  is  used  at  each 
iteration. 

The  Cauchy  point  does  not  depend  very  strongly  on  the  matrix  B which  is  used  only 
in  the  calculation  of  the  step  length.  Rapid  convergence  can  be  expected  only  if  B \  plays  a 
role  in  determining  the  direction  of  the  step  as  well  as  its  length,  and  if  Bk  contains  valid 
curvature  information  about  the  function. 

A  number  of  trust-region  algorithms  compute  the  Cauchy  point  and  then  try  to 
improve  on  it.  The  improvement  strategy  is  often  designed  so  that  the  full  step  pi  —  —  Bp 1  gk 
is  chosen  whenever  Bk  is  positive  definite  and  ||/r^||  <  A*.  When  Bk  is  the  exact  Hessian 
V2  /  (xk )  or  a  quasi-Newton  approximation,  this  strategy  can  be  expected  to  yield  superlinear 
convergence. 

We  now  consider  three  methods  for  finding  approximate  solutions  to  (4.3)  that  have 
the  features  just  described.  Throughout  this  section  we  will  be  focusing  on  the  internal 
workings  of  a  single  iteration,  so  we  simplify  the  notation  by  dropping  the  subscript  “k” 
from  the  quantities  A*,  pt,  ink ,  and  gk  and  refer  to  the  formulation  (4.5)  of  the  subproblem. 
In  this  section,  we  denote  the  solution  of  (4.5)  by  p*( A),  to  emphasize  the  dependence 
on  A. 


THE  DOGLEG  METHOD 

The  first  approach  we  discuss  goes  by  the  descriptive  title  of  the  dogleg  method.  It  can 
be  used  when  B  is  positive  definite. 

To  motivate  this  method,  we  start  by  examining  the  effect  of  the  trust-region  radius  A 
on  the  solution  p*(  A)  of  the  subproblem  (4.5).  When  B  is  positive  definite,  we  have  already 
noted  that  the  unconstrained  minimizer  of  m  is  pB  —  —B~1g.  When  this  point  is  feasible 
for  (4.5),  it  is  obviously  a  solution,  so  we  have 

//(A)  =  pB,  when  A  >  ||pB||.  (4.13) 

When  A  is  small  relative  to  //,  the  restriction  ||p||  <  A  ensures  that  the  quadratic  term  in 
m  has  little  effect  on  the  solution  of  (4.5).  For  such  A,  we  can  get  an  approximation  to  p{  A) 
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by  simply  omitting  the  quadratic  term  from  (4.5)  and  writing 

p*(  A)  ~  -A-^—,  when  A  is  small.  (4.14) 

11*11 


For  intermediate  values  of  A,  the  solution  p*(  A)  typically  follows  a  curved  trajectory  like 
the  one  in  Figure  4.4. 

The  dogleg  method  finds  an  approximate  solution  by  replacing  the  curved  trajectory 
for  p*(  A)  with  a  path  consisting  of  two  line  segments.  The  first  line  segment  runs  from  the 
origin  to  the  minimizer  of  m  along  the  steepest  descent  direction,  which  is 

=  (4-i5) 

r  Bg 

while  the  second  line  segment  runs  from  pv  to  pB  (see  Figure  4.4).  Formally,  we  denote  this 
trajectory  by  p{  r)  for  r  e  [0,  2],  where 


}  rpv,  0  <  r  <  1, 

j  pu  +  {r  -  l)(pB  -  pv),  1  <  r  <  2. 


(4.16) 


The  dogleg  method  chooses  p  to  minimize  the  model  m  along  this  path,  subject  to 
the  trust-region  bound.  The  following  lemma  shows  that  the  minimum  along  the  dogleg 
path  can  be  found  easily. 
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Lemma  4.2. 

Let  B  be  positive  definite.  Then 

(i)  ||/>(t)||  is  an  increasing  function  oft,  and 

(ii)  m(p(  r))  is  a  decreasing  function  of  r . 


Proof.  It  is  easy  to  show  that  (i)  and  (ii)  both  hold  for  r  e  [0,  1],  so  we  restrict  our 
attention  to  the  case  of  r  e  [1,  2].  For  (i),  define  h(a )  by 

Ha)  =  \\\p(l  +  a)||2 

=  \Wpv  +  Hp*  -  p^W2 

=  \\\PVW2  +  a(pu)T(pB  -  pv)  +  \ct2\\pB  -  pv II2. 


Our  result  is  proved  if  we  can  show  that  h'(a)  >  0  for  a  e  (0,  1).  Now, 


h\a)  =  -(PV)T(PV  ~  PB)  +  ct\\pv  -  p» ||2 
>  -(PU)T(PV-PB) 


T 

8  8  t 

T  D  & 

g1  Bg 


gT8  , 

~g  +  B  g 


g1  Bg 


t  gB~lg\. 
g  g  --  I  1  - 


(. gTg )2 


gTBg  L  ( gTBg)(gTB~1g ) 


>  0, 


where  the  final  inequality  is  a  consequence  of  the  Cauchy-Schwarz  inequality.  (We  leave  the 
details  as  an  exercise.) 

For  (ii),  we  define  h(a)  —  m(p{l  +  a))  and  show  that  hr(a)  <  0  for  a  e  (0,  1). 
Substitution  of  (4.16)  into  (4.5)  and  differentiation  with  respect  to  the  argument  leads  to 

h'(a)  —  (//  -  pv)T (g  +  Bpu)  +  a{pB  -  pv)T B(pB  -  pv) 

<  ip"  ~  pvV(g  +  Bpu  +  B(pB  -  p1’)) 

=  (pB-pu)T(g  +  Bp*)  =  0, 


giving  the  result.  □ 

It  follows  from  this  lemma  that  the  path  p{r)  intersects  the  trust-region  boundary 
||p||  =  A  at  exactly  one  point  if  ||//||  >  A,  and  nowhere  otherwise.  Since  m  is  decreasing 
along  the  path,  the  chosen  value  of  p  will  be  at  pB  if  ||pB||  <  A,  otherwise  at  the  point  of 
intersection  of  the  dogleg  and  the  trust-region  boundary.  In  the  latter  case,  we  compute  the 
appropriate  value  of  r  by  solving  the  following  scalar  quadratic  equation: 

||/>u  +  (t-l)(pB-jpu)||2  =  A2. 
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Consider  now  the  case  in  which  the  exact  Hessian  V2/(x/t)  is  available  for  use  in  the 
model  problem  (4.5).  When  V2/(x*)  is  positive  definite,  we  can  simply  set  B  =  V2/(x^) 
(that  is,  //  =  {V2f(xic))~1gk)  and  apply  the  procedure  above  to  find  the  Newton-dogleg 
step.  Otherwise,  we  can  define  pB  by  choosing  B  to  be  one  of  the  positive  definite  modified 
Hessians  described  in  Section  3.4,  then  proceed  as  above  to  find  the  dogleg  step.  Near 
a  solution  satisfying  second-order  sufficient  conditions  (see  Theorem  2.4),  pB  will  be  set 
to  the  usual  Newton  step,  allowing  the  possibility  of  rapid  local  convergence  of  Newton’s 
method  (see  Section  4.4). 

The  use  of  a  modified  Hessian  in  the  Newton-dogleg  method  is  not  completely 
satisfying  from  an  intuitive  viewpoint,  however.  A  modified  factorization  perturbs  the 
diagonals  of  V2/(.qt)  in  a  somewhat  arbitrary  manner,  and  the  benefits  of  the  trust-region 
approach  may  not  be  realized.  In  fact,  the  modification  introduced  during  the  factorization 
of  the  Hessian  is  redundant  in  some  sense  because  the  trust-region  strategy  introduces  its 
own  modification.  As  we  show  in  Section  4.3,  the  exact  solution  of  the  trust-region  problem 
(4.3)  with  Bk  —  V2f(xk)  is  (V2/(x^)  +  XI)~lgk,  where  X  is  chosen  large  enough  to  make 
(V2/(x/t)  +  XI)  positive  definite,  and  its  value  depends  on  the  trust-region  radius  A*.  We 
conclude  that  the  Newton-dogleg  method  is  most  appropriate  when  the  objective  function 
is  convex  (that  is,  V2/(x^)  is  always  positive  semidefinite).  The  techniques  described  below 
may  be  more  suitable  for  the  general  case. 

The  dogleg  strategy  can  be  adapted  to  handle  indefinite  matrices  B,  but  there  is  not 
much  point  in  doing  so  because  the  full  step  pB  is  not  the  unconstrained  minimizer  of  m 
in  this  case.  Instead,  we  now  describe  another  strategy,  which  aims  to  include  directions  of 
negative  curvature  (that  is,  directions  d  for  which  dT  Bd  <  0)  in  the  space  of  candidate 
trust-region  steps. 


TWO-DIMENSIONAL  SUBSPACE  MINIMIZATION 

When  B  is  positive  definite,  the  dogleg  method  strategy  can  be  made  slightly  more 
sophisticated  by  widening  the  search  for  p  to  the  entire  two-dimensional  subspace  spanned 
by  pv  and  pB  (equivalently,  g  and  —B~1g).  The  subproblem  (4.5)  is  replaced  by 

min m{p)  =  f  +  gT p  +  \pT Bp  s.t.  ||/?||  <  A,  p  e  span[g,  B~lg],  (4.17) 

p 

This  is  a  problem  in  two  variables  that  is  computationally  inexpensive  to  solve.  (After  some 
algebraic  manipulation  it  can  be  reduced  to  finding  the  roots  of  a  fourth  degree  polynomial.) 
Clearly,  the  Cauchy  point  pc  is  feasible  for  (4.17),  so  the  optimal  solution  of  this  subproblem 
yields  at  least  as  much  reduction  in  m  as  the  Cauchy  point,  resulting  in  global  convergence 
of  the  algorithm.  The  two-dimensional  subspace  minimization  strategy  is  obviously  an 
extension  of  the  dogleg  method  as  well,  since  the  entire  dogleg  path  lies  in  spanjg, 

This  strategy  can  be  modified  to  handle  the  case  of  indefinite  B  in  a  way  that  is  intuitive, 
practical,  and  theoretically  sound.  We  mention  just  the  salient  points  of  the  handling  of  the 
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indefiniteness  here,  and  refer  the  reader  to  papers  by  Byrd,  Schnabel,  and  Schultz  (see  [54] 
and  [279])  for  details.  When  B  has  negative  eigenvalues,  the  two-dimensional  subspace  in 
(4.17)  is  changed  to 

spanfg,  {B  +  a/)-1g],  for  some  a  e  (— Ai,  —  2Ai],  (4.18) 

where  A  i  denotes  the  most  negative  eigenvalue  of  B.  (This  choice  of  a  ensures  that  B  +  al  is 
positive  definite,  and  the  flexibility  in  the  choice  of  a  allows  us  to  use  a  numerical  procedure 
such  as  the  Lanczos  method  to  compute  it.)  When  ||(_B+a,/)_1,g||  <  A,  we  discard  the 
subspace  search  of  (4.17),  (4.18)  and  instead  define  the  step  to  be 

p  —  ~(B  +  al)~lg  +  v,  (4.19) 

where  v  is  a  vector  that  satisfies  vr(B  +  otI)~1g  <  0.  (This  condition  ensures  that  ||/?||  > 
|| (B  +  a/)-1g||.)  When  B  has  zero  eigenvalues  but  no  negative  eigenvalues,  we  define  the 
step  to  be  the  Cauchy  point  p  —  pc. 

When  the  exact  Hessian  is  available,  we  can  set  B  —  V2f(xk),  and  note  that  B~lg  is 
the  Newton  step.  Hence,  when  the  Hessian  is  positive  definite  at  the  solution  x*  and  when 
Xk  is  close  to  x*  and  A  is  sufficiently  large,  the  subspace  minimization  problem  (4.17)  will 
be  solved  by  the  Newton  step. 

The  reduction  in  model  function  m  achieved  by  the  two-dimensional  subspace  min¬ 
imization  strategy  often  is  close  to  the  reduction  achieved  by  the  exact  solution  of  (4.5). 
Most  of  the  computational  effort  lies  in  a  single  factorization  of  B  or  B  +  a  I  (estimation  of 
a  and  solution  of  (4.17)  are  less  significant),  while  strategies  that  find  nearly  exact  solutions 
of  (4.5)  typically  require  two  or  three  such  factorizations  (see  Section  4.3). 


4.2  GLOBAL  CONVERGENCE 

REDUCTION  OBTAINED  BY  THE  CAUCHY  POINT 

In  the  preceding  discussion  of  algorithms  for  approximately  solving  the  trust-region 
subproblem,  we  have  repeatedly  emphasized  that  global  convergence  depends  on  the  ap¬ 
proximate  solution  obtaining  at  least  as  much  decrease  in  the  model  function  m  as  the 
Cauchy  point.  (In  fact,  a  fixed  positive  fraction  of  the  Cauchy  decrease  suffices.)  We  start 
the  global  convergence  analysis  by  obtaining  an  estimate  of  the  decrease  in  m  achieved  by 
the  Cauchy  point.  We  then  use  this  estimate  to  prove  that  the  sequence  of  gradients  {gk\ 
generated  by  Algorithm  4.1  has  an  accumulation  point  at  zero,  and  in  fact  converges  to  zero 
when  r]  is  strictly  positive. 

Our  first  main  result  is  that  the  dogleg  and  two-dimensional  subspace  minimization 
algorithms  and  Steihaug’s  algorithm  (Algorithm  7.2)  produce  approximate  solutions  pk  of 
the  subproblem  (4.3)  that  satisfy  the  following  estimate  of  decrease  in  the  model  function: 

mk{ 0)  -  mk(pk)  >  Ci  || gk ||  min  ^A*,  ||qj^  > 


(4.20) 
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for  some  constant  Ci  e  (0,1].  The  usefulness  of  this  estimate  will  become  clear  in  the 
following  two  sections.  For  now,  we  note  that  when  is  the  minimum  value  in  (4.20),  the 
condition  is  slightly  reminiscent  of  the  first  Wolfe  condition:  The  desired  reduction  in  the 
model  is  proportional  to  the  gradient  and  the  size  of  the  step. 

We  show  now  that  the  Cauchy  point  pk  satisfies  (4.20),  with  C\  =  ' . 

Lemma  4.3. 

The  Cauchy  point  pk  satisfies  (4.20)  with  c\  —  that  is, 

mk{ 0)  -  mk(pck)  >  1 1| ||  min  •  (4.21) 

PROOF.  For  simplicity,  we  drop  the  iteration  index  k  in  the  proof. 

We  consider  first  the  case  gT Bg  <  0.  Here,  we  have 


m{pc)  -  m( 0)  =  m(—£s.g/\\g\\)  -  / 
A 

=  -innrlls 


2  +  \jLzTBg 


<  -All 


<  -||g||  min  A, 


\B\ 


and  so  (4.21)  certainly  holds. 

For  the  next  case,  consider  gT Bg  >  0  and 


<  1. 


A  gTBg 

From  (4.12),  we  have  r  =  ||g||3/  (A gT Bg),  and  so  from  (4.11)  it  follows  that 


(4.22) 


m{pc)  -  m( 0)  =  -  +  \gTBg 

g1  Bg 


(gT  Bg)2 


2  nT 


g1  Bg 


< 


1  115  II 

'3l|B|lkll2 

Jlgll2 

'  2 


|5|| 


^  111  II  •  (  A 

<  —He  mm  A, -  , 


so  (4.21)  holds  here  too. 

In  the  remaining  case,  (4.22)  does  not  hold,  and  therefore 


gTBg  < 


llgll3 


A 


(4.23) 
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From  (4.12),  we  have  r  =  1,  and  using  this  fact  together  with  (4.23),  we  obtain 


A 

m(pc)  —  m  (0)  = - 1 

\\8\\ 

<  —  All  el 


=  -Ml 


,  1  A2  T 

+  2M>8  B8 

1  A2  [|g||3 

2  Ikll2  A 


<  — i||e||  min  A,  — 
2llS"  IIBI 


yielding  the  desired  result  (4.21)  once  again. 


□ 


To  satisfy  (4.20),  our  approximate  solution  p k  has  only  to  achieve  a  reduction  that  is 
at  least  some  fixed  fraction  C2  of  the  reduction  achieved  by  the  Cauchy  point.  We  state  the 
observation  formally  as  a  theorem. 

Theorem  4.4. 

Letpk  be  any  vector  such  that  \\pk\\  <  A*  andmk(0)  —  mk(pk)  >  C2  (»U-(0)  —  mk(pl)). 
Then  pk  satisfies  (4.20)  with  c\  —  C2/2.  In  particular,  if  pk  is  the  exact  solution  pi  of  (4.3), 
then  it  satisfies  (4.20)  with  Ci  —  \. 

Proof.  Since  \\pk  ||  <  A*,  we  have  from  Lemma  4.3  that 

mk( 0)  -  mk(pk)  >  c2  (mk( 0)  -  mk(pck ))  >  \c2\\gk\\  min  f  A*,  » 


giving  the  result.  □ 

Note  that  the  dogleg  and  two-dimensional  subspace  minimization  algorithms  both 
satisfy  (4.20)  with  c  1  =  |,  because  they  all  produce  approximate  solutions  pk  for  which 
mk(pk)  <  mk(pCk). 

CONVERGENCE  TO  STATIONARY  POINTS 

Global  convergence  results  for  trust-region  methods  come  in  two  varieties,  depending 
on  whether  we  set  the  parameter  ;/  in  Algorithm  4.1  to  zero  or  to  some  small  positive  value. 
When  p  —  0  (that  is,  the  step  is  taken  whenever  it  produces  a  lower  value  of  /),  we  can 
show  that  the  sequence  of  gradients  {gi-}  has  a  limit  point  at  zero.  For  the  more  stringent 
acceptance  test  with  rj  >  0,  which  requires  the  actual  decrease  in  /  to  be  at  least  some  small 
fraction  of  the  predicted  decrease,  we  have  the  stronger  result  that  gk  -»  0. 

In  this  section  we  prove  the  global  convergence  results  for  both  cases.  We  assume 
throughout  that  the  approximate  Flessians  Bk  are  uniformly  bounded  in  norm,  and  that  / 
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is  bounded  below  on  the  level  set 


S  =  [x  |  f{x)  <  fix o)}. 


For  later  reference,  we  define  an  open  neighborhood  of  this  set  by 


(4.24) 


SiRo)  =  [x  |  \\x  —  _y  ||  <  R0  for  some  y  e  5}, 
where  R0  is  a  positive  constant. 

To  allow  our  results  to  be  applied  more  generally,  we  also  allow  the  length  of  the 
approximate  solution  pk  of  (4.3)  to  exceed  the  trust-region  bound,  provided  that  it  stays 
within  some  fixed  multiple  of  the  bound;  that  is, 

||p* ||  <  yA*,  for  some  constant  y  >  1.  (4.25) 

The  first  result  deals  with  the  case  ;/  =  0. 

Theorem  4.5. 

Let  rj  —  0  in  Algorithm  4.1.  Suppose  that  11-6*11  <  f  for  some  constant  f,  that  f  is 
bounded  below  on  the  level  set  S  defined  by  (4.24)  and  Lipschitz  continuously  differentiable  in 
the  neighborhood  S(Ro)  for  some  R0  >  0,  and  that  all  approximate  solutions  of  (4.3)  satisfy 
the  inequalities  (4.20)  and  (4.25),  for  some  positive  constants  c i  and  y.  We  then  have 

liminf  ||g*||  =  0.  (4.26) 

k — >-oo 

PROOF.  By  performing  some  technical  manipulation  with  the  ratio  p*  from  (4.4),  we  obtain 


11  = 

ifixk)  -  fixk  +  Pk)) 

-  imk( 0)  -  mk(pk)) 

mk( 0)  - 

mk(pk) 

mk(Pk)  -  f(xk  +  Pk) 

mki 0)  -  mk(pk) 

Since  from  Taylor’s  theorem  (Theorem  2.1)  we  have  that 


fixk  +  Pk)  —  f(xk)  +  g(xkY  pk  +  /  [g(xk  +  tpk)  -  g(xk)Y  pk  dt, 

Jo 

for  some  1  e  (0,  1),  it  follows  from  the  definition  (4.2)  of  mk  that 


I mk(pk)  -  f(xk  +  pk) |  = 


1  Pk  BkPk  ~  I  [gixk  +  tpk)  -  g(xk)Y  Pk  dt 


f 


<  (^/2) Up* II 2  +  A ||p* II 


(4.27) 
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where  we  have  used  ydi  to  denote  the  Lipschitz  constant  for  g  on  the  set  S(Rq),  and  assumed 
that  || pk ||  <  Rq  to  ensure  that  Xk  and  Xk  +  tpk  both  lie  in  the  set  S{Rq). 

Suppose  for  contradiction  that  there  is  e  >  0  and  a  positive  index  K  such  that 


II gill  >  e,  for  all  k  >  K. 


From  (4.20),  we  have  for  k  >  K  that 


mk{ 0)  -  mk{pk)  >  ci  || gt||  min 


>  C\e  min 


(4.28) 


(4.29) 


Using  (4.29),  (4.27),  and  the  bound  (4.25),  we  have 


I  Pk  -  1|  < 


y2A2(P/2  +  pi) 
cte  min(A^.,  e/yd) ' 


(4.30) 


We  now  derive  a  bound  on  the  right-hand-side  that  holds  for  all  sufficiently  small  values  of 
A k,  that  is,  for  all  Ak  <  A,  where  A  is  defined  as  follows: 


A  =  min 


/ 1  C\C  Rq\ 

\2Y2{f$/2  +  (hYT)' 


(4.31) 


The  Ro/y  term  in  this  definition  ensures  that  the  bound  (4.27)  is  valid  (because  ||p/.||  < 
y  Ak  <  y  A  <  R0).  Note  that  since  Ci  <  1  and  y  >  1,  we  have  A  <  e/yd.  The  latter 
condition  implies  that  for  all  Ak  e  [0,  A],  we  have  min(A^,  e/yd)  =  A*,  so  from  (4.30)  and 
(4.31),  we  have 


,  , ,  V2 A2k{f}/2  +  Pi)  y2Ak{fi/2  +  fii)  y"A(P/2  +  Pi)  1 

I  Pk  -  1|  <  - - -  =  -  <  -  <  “■ 

c\eAk  cie  cie  2 

Therefore,  pk  >  and  so  by  the  workings  of  Algorithm  4.1,  we  have  A*+1  >  A*  whenever 

A k  falls  below  the  threshold  A.  It  follows  that  reduction  of  Ak  (by  a  factor  of  |)  can  occur 
in  our  algorithm  only  if 


A k  >  A, 


and  therefore  we  conclude  that 

A k  >  min  (A^r,  A/4)  for  all  A:  >  K.  (4.32) 

Suppose  now  that  there  is  an  infinite  subsequence  /C  such  that  pk  >  |  for  k  e  1C.  For 
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k  e  1C  and  k  >  K,  we  have  from  (4.29)  that 

fix, t)  -  f{xk+ 1)  =  f(xk )  -  f{xk  +  pk) 

>  |  [mk( 0)  -  mk{pk)] 

>  jCif  min(A^,  e/ P). 

Since  /  is  bounded  below,  it  follows  from  this  inequality  that 

lim  Ak  =  0, 

keIC,  k—>o o 

contradicting  (4.32).  Hence  no  such  infinite  subsequence  JC  can  exist,  and  we  must  have 
pk  <  \  for  all  k  sufficiently  large.  In  this  case,  Ak  will  eventually  be  multiplied  by  j  at  every 
iteration,  and  we  have  lim^oo  Aj.  =  0,  which  again  contradicts  (4.32).  Hence,  our  original 
assertion  (4.28)  must  be  false,  giving  (4.26).  □ 

Our  second  global  convergence  result,  for  the  case  q  >  0,  borrows  much  of  the  analysis 
from  the  proof  above.  Our  approach  here  follows  that  of  Schultz,  Schnabel,  and  Byrd  [279] . 

Theorem  4.6. 

Letii  e  (O'  |)  in  Algorithm  4.1.  Suppose  that  \\Bk\\  <  ft  for  some  constant  /3 ,  that  f  is 
bounded  below  on  the  level  set  S  (4.24)  and  Lipschitz  continuously  differentiable  in  S(Ro)  for 
some  Rq  >  0,  and  that  all  approximate  solutions  pk  of  (4.3)  satisfy  the  inequalities  (4.20)  and 
(4.25)  for  some  positive  constants  c\  andy.  We  then  have 

lim  gk  —  0.  (4.33) 

k^-oo 

PROOF.  We  consider  a  particular  positive  index  m  with  gm  0.  Using  again  to  denote 
the  Lipschitz  constant  for  g  on  the  set  S(Rq),  we  have 

IlgOO  -  gm||  <  Pi\\x  -  xm\\, 

for  all  x  e  S{Rq).  We  now  define  the  scalars  e  and  R  to  satisfy 
e=jllg»ill>  W  =  min 

Note  that  the  ball 

B(xm,  R)  —  {x  |  H*  —  xm\\  <  f?} 
is  contained  in  S( Ro),  so  Lipschitz  continuity  of  g  holds  inside  B(xm,  R).  We  have 
*  e  B(xm,  R)  =>  ||g(*)||  >  Ilf  mil  -  ||f(*)  -  gm  II  >  \  II  fm  II  =  <?• 

If  the  entire  sequence  [xk}k>m  stays  inside  the  ball  B(xm.  R),  we  would  have  ||g*||  >  e  >  0 
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for  all  k  >  m.  The  reasoning  in  the  proof  of  Theorem  4.5  can  be  used  to  show  that  this 
scenario  does  not  occur.  Therefore,  the  sequence  {xk)k>m  eventually  leaves  B{xm,  R ). 

Let  the  index  l  >  m  be  such  that  x/+i  is  the  first  iterate  after  xm  outside  B(xm,  R). 
Since  ||g*||  >  e  for  k  —  m,  m  +  1,  . . . ,  /,  we  can  use  (4.29)  to  write 


/ 

f(xm )  -  f(x,+ 1)  =  ^2  f{x k)  -  f(xk+ 1) 

k=m 


> 


> 


eL 

eL 


k=m,Xk^Xk+i 
l 


r][mki 0)  -  mk(pk)] 


'k=m,xt^xt+i 


11C\€  min  I  Ak 


where  we  have  limited  the  sum  to  the  iterations  k  for  which  xk  xk+\ ,  that  is,  those  iterations 

on  which  a  step  was  actually  taken.  If  Ak  <  e/ f  for  allfc  =  m,  m  +  1, . . . ,  /,  we  have 


i 

f(xm)  ~  fixi+x)  >  r)Ci€  ^2  Ak  >  rjCieR  -  ^demin 

k=m,Xk=£xk+ i 


(4.34) 


Otherwise,  we  have  Ak  >  e//3  for  some  k  —  m,  m  +  1, . . . ,  l,  and  so 

f(xm)  ~  f(xi+ 1)  > 

P 

Since  the  sequence  {f(xk)}fL0  is  decreasing  and  bounded  below,  we  have  that 


fixk)  4,  /* 


for  some  f*  >  — oo.  Therefore,  using  (4.34)  and  (4.35),  we  can  write 

f(xm)  ~  f*  >  f(xm)  -  f(xi+ 1) 

"  e  e 

3’Ji' 


i e  e 

>  r)C\e  mm  I  ,  R0 


1 

=  -nciWgmW  mm 


\\gm\\  II  gn 


2  p  2fa 

Since  f(xm  )  —  f*  |  0,  we  must  have  gm  — >  0,  giving  the  result. 


Ro  >  0. 


(4.35) 


(4.36) 


□ 


4.3  ITERATIVE  SOLUTION  OF  THE  SUBPROBLEM 


In  this  section,  we  describe  a  technique  that  uses  the  characterization  (4.6)  of  the  subprob¬ 
lem  solution,  applying  Newton’s  method  to  find  the  value  of  X  which  matches  the  given 
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trust-region  radius  A  in  (4.5).  We  also  prove  the  key  result  Theorem  4.1  concerning  the 
characterization  of  solutions  of  (4.3). 

The  methods  of  Section  4.1  make  no  serious  attempt  to  find  the  exact  solution  of 
the  subproblem  (4.5).  They  do,  however,  make  some  use  of  the  information  in  the  model 
Hessian  B and  they  have  advantages  of  reasonable  implementation  cost  and  nice  global 
convergence  properties. 

When  the  problem  is  relatively  small  (that  is,  n  is  not  too  large),  it  may  be  worthwhile 
to  exploit  the  model  more  fully  by  looking  for  a  closer  approximation  to  the  solution  of  the 
subproblem.  In  this  section,  we  describe  an  approach  for  finding  a  good  approximation  at  the 
cost  of  a  few  factorizations  of  the  matrix  B  (typically  three  factorization),  as  compared  with 
a  single  factorization  for  the  dogleg  and  two-dimensional  subspace  minimization  methods. 
This  approach  is  based  on  the  characterization  of  the  exact  solution  given  in  Theorem  4.1, 
together  with  an  ingenious  application  of  Newton’s  method  in  one  variable.  Essentially,  the 
algorithm  tries  to  identify  the  value  of  1  for  which  (4.6)  is  satisfied  by  the  solution  of  (4.5). 

The  characterization  of  Theorem  4.1  suggests  an  algorithm  for  finding  the  solution  p 
of  (4.7).  Either  1  =  0  satisfies  (4.8a)  and  (4.8c)  with  ||p||  <  A,  or  else  we  define 

p(k)  =  -(B  +  kI)-1g 

for  X  sufficiently  large  that  B  +  XI  is  positive  definite  and  seek  a  value  X  >  0  such  that 

\\p(X)\\  =  A.  (4.37) 

This  problem  is  a  one-dimensional  root-finding  problem  in  the  variable  X. 

To  see  that  a  value  of  X  with  all  the  desired  properties  exists,  we  appeal  to  the  eigende- 
composition  of  B  and  use  it  to  study  the  properties  of  ||  p(X)  || .  Since  B  is  symmetric,  there 
is  an  orthogonal  matrix  Q  and  a  diagonal  matrix  A  such  that  B  =  Q  A  QT ,  where 

A  =  diag(li,  X2,  1„), 

and  li  <  X2  <  ■  ■  ■  <  are  the  eigenvalues  of  B;  see  (A.  16).  Clearly,  B  +  XI  =  Q(A  + 
XI) Qt ,  and  for  1  /  Xj,  we  have 


p(X)  =  -Q( A  +  XI)-lQTg  =  ~  T  (4-38) 

y=i  Xj+X 


where  qj  denotes  the  jth  column  of  Q.  Therefore,  by  orthonormality  of  q\,  q2  q„,  we 
have 


iwwi2  =  E  ^ 

7=1 


(*7 


1) 


2  ' 


(4.39) 
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This  expression  tells  us  a  lot  about  ||/?(A)||.  If  X  >  —X\,  we  have  kj  +  X  >  0  for  all 
j  =  1,2 , ,n,  and  so  ||/?(A.)||  is  a  continuous,  nonincreasing  function  of  X  on  the  interval 
(— Xi,  oo).  In  fact,  we  have  that 


lim  ||/?(A.)||  =  0.  (4.40) 

X— >-oo 

Moreover,  we  have  when  qj g  ^  0  that 

lim  ||  /7(A)  ||  =  oo.  (4.41) 

X — ^  — X,j 

Figure  4.5  plots  ||  p(k)  ||  against  X  in  a  case  in  whcih  q\ g,  q\ g,  and  qj g  are  all  nonzero. 
Note  that  the  properties  (4.40)  and  (4.41)  hold  and  that  ||/?(A)||  is  a  nonincreasing  function 
of  X  on  (— A.i,  oo).  In  particular,  as  is  always  the  case  when  q\ g  /  0,  that  there  is  a  unique 
value  X *  e  (— A1;  oo)  such  that  || /?(!*)  ||  =  A.  (There  maybe  other,  smaller  values  of  A  for 
which  ||/t(A)||  =  A,  but  these  will  fail  to  satisfy  (4.8c).) 

We  now  sketch  a  procedure  for  identifying  the  A.*  e  (— Xlt  oo)  for  which  ||p(A.*)||  =  A, 
which  works  when  q[ g  ^  0.  (We  discuss  the  case  of  q{ g  =  0  later.)  First,  note  that  when  B 
positive  definite  and  ||fi_1g||  <  A,  the  value  1  =  0  satisfies  (4.8),  so  the  procedure  can  be 
terminated  immediately  with  1*  =  0.  Otherwise,  we  could  use  the  root-finding  Newton’s 
method  (see  the  Appendix)  to  find  the  value  of  X  >  —A.!  that  solves 


<MA)  =  WpMW  —  A  =  0. 


(4.42) 
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The  disadvantage  of  this  approach  can  be  seen  by  considering  the  form  of  ||/?(A.)||  when  X 
is  greater  than,  but  close  to,  —  M .  For  such  X,  we  can  approximate  (pi  by  a  rational  function, 
as  follows: 


<t P\(X)  ^  t — — — b  C2, 

X  +  X\ 

where  C 1  >  0  and  C2  are  constants.  Clearly  this  approximation  (and  hence  (pi)  is  highly 
nonlinear,  so  the  root- finding  Newton’s  method  will  be  unreliable  or  slow.  Better  results  will 
be  obtained  if  we  reformulate  the  problem  (4.42)  so  that  it  is  nearly  linear  near  the  optimal 
X.  By  defining 


<hM  = 


1 

A 


1 

WpWW' 


it  can  be  shown  using  (4.39)  that  for  X  slightly  greater  than  — A.i,  we  have 


02  (A.) 


1  4  + Ai 

A  “  C3 


for  some  C3  >  0.  Hence,  (p2  is  nearly  linear  near  —  (see  Figure  4.6),  and  the  root-finding 


Figure4.6  1/||/?(A.)||  as  a  function  of  A. 


4.3. 


Iterative  Solution  of  the  Subproblem  87 


Newton’s  method  will  perform  well,  provided  that  it  maintains  X  >  —X] .  The  root-finding 
Newton’s  method  applied  to  02  generates  a  sequence  of  iterates  X ^  by  setting 


x(t+  D  = 


02  (*<«) 
0'  (^))‘ 


(4.43) 


After  some  elementary  manipulation,  this  updating  formula  can  be  implemented  in  the 
following  practical  way. 


Algorithm  4.3  (Trust  Region  Subproblem). 
Given  A.(0),  A  >  0: 
for  l  —  0,  1,  2, . . . 

Factor  B  +  X{1) I  =  RT R; 

Solve  RT Rpt  =  -g,  RT qi  =  pi ; 


Set 


,K+1)  |  (\\pt\\\2  (\\pi\\  -  . 

\\\qt\\J  v  a  )' 


(4.44) 


end  (for). 

Safeguards  must  be  added  to  this  algorithm  to  make  it  practical;  for  instance,  when 
X ^  <  —Xi,  the  Cholesky  factorization  B  +  Xt'l)I  =  RT  R  will  not  exist.  A  slightly  enhanced 
version  of  this  algorithm  does,  however,  converge  to  a  solution  of  (4.37)  in  most  cases. 

The  main  work  in  each  iteration  of  this  method  is,  of  course,  the  Cholesky  factorization 
of  B  X^I.  Practical  versions  of  this  algorithm  do  not  iterate  until  convergence  to  the 
optimal  X  is  obtained  with  high  accuracy,  but  are  content  with  an  approximate  solution  that 
can  be  obtained  in  two  or  three  iterations. 

THE  HARD  CASE 

Recall  that  in  the  discussion  above,  we  assumed  that  qf  g  ^  0.  In  fact,  the  approach 
described  above  can  be  applied  even  when  the  most  negative  eigenvalue  is  a  multiple 
eigenvalue  (that  is,  0  >  A.i  =  A.2  =  •  •  •)>  provided  that  Q\g  ^  0,  where  Q\  is  the  matrix 
whose  columns  span  the  subspace  corresponding  to  the  eigenvalue  X  j.  When  this  condition 
does  not  hold,  the  situation  becomes  a  little  complicated,  because  the  limit  (4.41)  does  not 
hold  for  Xj  =  T-i  and  so  there  may  not  be  a  value  X  e  (— iG,  oo)  such  that  ||p(A)||  =  A  (see 
Figure  4.7).  More  and  Sorensen  [214]  refer  to  this  case  as  the  hard  case.  At  first  glance,  it  is 
not  clear  how  p  and  X  can  be  chosen  to  satisfy  (4.8)  in  the  hard  case.  Clearly,  our  root-finding 
technique  will  not  work,  since  there  is  no  solution  for  X  in  the  open  interval  (— Ai ,  oo).  But 
Theorem  4.1  assures  us  that  the  rightvalue  ofAliesinthe  interval  [— T-i,  oo), so  there  is  only 
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one  possibility:  X  —  —Xi.  To  find  p,  it  is  not  enough  to  delete  the  terms  for  which  Xj  =  Xi 
from  the  formula  (4.38)  and  set 


*=  E 

j*i¥*  i 


T 

VjS 
X  j  +  X 


qj. 


Instead,  we  note  that  ( B  —  X\I)  is  singular,  so  there  is  a  vector  z  such  that  ||z||  =  1  and 
(B  —  X\I)z  —  0.  In  fact,  z  is  an  eigenvector  of  B  corresponding  to  the  eigenvalue  Ij,  so  by 
orthogonality  of  Q  we  have  qj  z  —  0  for  Xj  ^  X  i .  It  follows  from  this  property  that  if  we  set 

T 

p=  H  +  <t45) 


for  any  scalar  r,  we  have 


=  E 


i 


(Xj  +  xy 


+  r2, 


so  it  is  always  possible  to  choose  r  to  ensure  that  ||p||  =  A.  It  is  easy  to  check  that  the 
conditions  (4.8)  holds  for  this  choice  of  p  and  X  —  —X1. 
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PROOF  OF  THEOREM  4.1 

We  now  give  a  formal  proof  of  Theorem  4.1,  the  result  that  characterizes  the  exact 
solution  of  (4.5).  The  proof  relies  on  the  following  technical  lemma,  which  deals  with  the 
unconstrained  minimizers  of  quadratics  and  is  particularly  interesting  in  the  case  where  the 
Hessian  is  positive  semidefinite. 

Lemma  4.7. 

Let  m  be  the  quadratic  function  defined  by 

m(p)  =  gT  p  +  \pT  Bp,  (4.46) 

where  B  is  any  symmetric  matrix.  Then  the  following  statements  are  true. 

(i)  m  attains  a  minimum  if  and  only  if  B  is  positive  semidefinite  and  g  is  in  the  range  of  B . 
IfB  is  positive  semidefinite,  then  every  p  satisfying  Bp  —  —g  is  a  global  minimizer  of  m . 

(ii)  m  has  a  unique  minimizer  if  and  only  if  B  is  positive  definite. 

Proof.  We  prove  each  of  the  three  claims  in  turn. 

(i)  We  start  by  proving  the  “if”  part.  Since  g  is  in  the  range  of  B,  there  is  a  p  with  Bp  =  —g. 
For  all  w  e  Rn,  we  have 

m(p  +  w)  =  gT(p  +  w)  +  \(p  +  w)T  B{p  +  w) 

—  {gT  P  +  \pT  Bp)  +  gTw  +  (Bp)T  w  +  \wT  Bw 

—  m(p)  +  \wT  Bw 

>  m{p),  (4.47) 

since  B  is  positive  semidefinite.  Hence,  p  is  a  minimizer  of  m. 

For  the  “only  if”  part,  let  p  be  a  minimizer  of  m.  Since  Vm(p)  =  Bp  +  g  =  0,  we 
have  that  g  is  in  the  range  of  B.  Also,  we  have  W2m(p)  —  B  positive  semidefinite,  giving 
the  result. 

(ii)  For  the  “if”  part,  the  same  argument  as  in  (i)  suffices  with  the  additional  point  that 

wT Bw  >  0  whenever  w/0.  For  the  “only  if”  part,  we  proceed  as  in  (i)  to  deduce  that  B  is 
positive  semidefinite.  If  B  is  not  positive  definite,  there  is  a  vector  »  /  0  such  that  Bw  =  0. 
Hence,  from  (4.47),  we  have  m{p  +  w)  —  m(p),  so  the  minimizer  is  not  unique,  giving  a 
contradiction.  □ 

To  illustrate  case  (i),  suppose  that 


1  0  0 
0  0 


B  = 


0 


0 


0 

2 
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which  has  eigenvalues  0,  1,  2  and  is  therefore  singular.  If  g  is  any  vector  whose  second 
component  is  zero,  then  g  will  be  in  the  range  of  B,  and  the  quadratic  will  attain  a  minimum. 
But  if  the  second  element  in  g  is  nonzero,  we  can  decrease  m  ( • )  indefinitely  by  moving  along 
the  direction  (0,  —g 2,  0)r. 

We  are  now  in  a  position  to  take  account  of  the  trust-region  bound  ||p||  <  A  and 
hence  prove  Theorem  4.1. 

PROOF.  ( Theorem  4.1) 

Assume  first  that  there  is  X  >  0  such  that  the  conditions  (4.8)  are  satisfied. 
Lemma  4.7(i)  implies  that  p*  is  a  global  minimum  of  the  quadratic  function 

m(p)  =  gTp  +  \pT(B  +  XI)p  =  m(p)  +  \pT  p.  (4.48) 

z  2 

Since  m{ p)  >  in (p*),  we  have 

m(p)  >  m(p*)  +  ^{(p*)T p*  -  pTp).  (4.49) 

Because  A(A  —  ||p*||)  =  0  and  therefore  4 (A2  —  (p*)T p*)  =  0,  we  have 

m(p)  >  m{p*)  +  ^(A2  -  pT p). 

Hence,  from  X  >  0,  we  have  m{p)  >  m(p*)  for  all  p  with  ||y»||  <  A.  Therefore,  p*  is  a 
global  minimizer  of  (4.7). 

For  the  converse,  we  assume  that  p*  is  a  global  solution  of  (4.7)  and  show  that  there 
is  a  X  >  0  that  satisfies  (4.8). 

In  the  case  ||p*||  <  A,  p*  is  an  unconstrained  minimizer  of  m ,  and  so 
Vm(p*)  —  Bp*  +  g  =  0,  V2m(p*)  =  B  positive  semidefinite, 

and  so  the  properties  (4.8)  hold  for  X  =  0. 

Assume  for  the  remainder  of  the  proof  that  ||p*||  =  A.  Then  (4.8b)  is  immediately 
satisfied,  and  p*  also  solves  the  constrained  problem 

min m(p)  subject  to  ||p||  =  A. 

By  applying  optimality  conditions  for  constrained  optimization  to  this  problem  (see 
( 12.34)),  we  find  that  there  is  a  A  such  that  the  Lagrangian  function  defined  by 

£{p,  X)  =  m(p)  +  ^{pT p  -  A2) 
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has  a  stationary  point  at  p* .  By  setting  V pC{p* ,  4)  to  zero,  we  obtain 

Bp*  +  g  +  Xp*  —  0  (B  +  XI)p*  = -g,  (4.50) 

so  that  (4.8a)  holds.  Since  m(p)  >  m(p*)  for  any  p  with  pT p  —  (p*)T p*  —  A2,  we  have 
for  such  vectors  p  that 


m(p)  >  m(p*)  +  -  ({p*)T p*  -  pT p)  . 

If  we  substitute  the  expression  for  g  from  (4.50)  into  this  expression,  we  obtain  after  some 
rearrangement  that 


\{p-p*)r{B  +  XI)(p-p*)>  0. 


(4.51) 


Since  the  set  of  directions 

p  —  p* 

w  :  w  —  ± - ,  for  some  p  with  ||p||  =  A 

\\P~P*\\ 

is  dense  on  the  unit  sphere,  (4.51)  suffices  to  prove  (4.8c). 

It  remains  to  show  that  X  >  0.  Because  (4.8a)  and  (4.8c)  are  satisfied  by  p*,  we  have 
from  Lemma  4. 7(i)  that  p*  minimizes  m,  so  (4.49)  holds.  Suppose  that  there  are  only  negative 
values  of  X  that  satisfy  (4.8a)  and  (4.8c).  Then  we  have  from  (4.49)  that  m(p)  >  m(p*) 
whenever  ||/r||  >  ||/?*||  =  A.  Since  we  already  know  that  p*  minimizes  m  for  ||p||  <  A, 
it  follows  that  m  is  in  fact  a  global,  unconstrained  minimizer  of  m.  From  Lemma  4.7(i)  it 
follows  that  Bp  =  —  g  and  B  is  positive  semidefinite.  Therefore  conditions  (4.8a)  and  (4.8c) 
are  satisfied  by  X  =  0,  which  contradicts  our  assumption  that  only  negative  values  of  X  can 
satisfy  the  conditions.  We  conclude  that  X  >  0,  completing  the  proof.  □ 


CONVERGENCE  OF  ALGORITHMS  BASED  ON  NEARLY  EXACT  SOLUTIONS 

As  we  noted  in  the  discussion  of  Algorithm  4.3,  the  loop  to  determine  the  optimal 
values  of  X  and  p  for  the  subproblem  (4.5)  does  not  iterate  until  high  accuracy  is  achieved. 
Instead,  it  is  terminated  after  two  or  three  iterations  with  a  fairly  loose  approximation  to 
the  true  solution.  The  inexactness  in  this  approximate  solution  is  measured  in  a  different 
way  from  the  dogleg  and  subspace  minimization  algorithms.  We  can  add  safeguards  to  the 
root-finding  Newton  method  to  ensure  that  the  key  assumptions  of  Theorems  4.5  and  4.6 
are  satisfied  by  the  approximate  solution.  Specifically,  we  require  that 


/77 (0)  —  m{p)  >  Ci (777 (0)  —  m(p*)), 
\\p\\  <  M 


(4.52a) 

(4.52b) 
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(where  p*  is  the  exact  solution  of  (4.3)),  for  some  constants  Ci  e  (0,  1]  and  y  >  0.  The 
condition  (4.52a)  ensures  that  the  approximate  solution  achieves  a  significant  fraction  of  the 
maximum  decrease  possible  in  the  model  function  m.  (It  is  not  necessary  to  know  p*;  there 
are  practical  termination  criteria  that  imply  (4.52a).)  One  major  difference  between  (4.52) 
and  the  earlier  criterion  (4.20)  is  that  (4.52)  makes  better  use  of  the  second-order  part  of 
/77  ( - ) ,  that  is,  the  pT  Bp  term.  This  difference  is  illustrated  by  the  case  in  which  g  =  0  while 
B  has  negative  eigenvalues,  indicating  that  the  current  iterate  Xk  is  a  saddle  point.  Here, 
the  right-hand-side  of  (4.20)  is  zero  (indeed,  the  algorithms  we  described  earlier  would 
terminate  at  such  a  point).  The  right-hand-side  of  (4.52)  is  positive,  indicating  that  decrease 
in  the  model  function  is  still  possible,  so  it  forces  the  algorithm  to  move  away  from  Xk- 

The  close  attention  that  near-exact  algorithms  pay  to  the  second-order  term  is  war¬ 
ranted  only  if  this  term  closely  reflects  the  actual  behavior  of  the  function  / — in  fact, 
the  trust-region  Newton  method,  for  which  B  —  V2/(x),  is  the  only  case  that  has  been 
treated  in  the  literature.  For  purposes  of  global  convergence  analysis,  the  use  of  the  exact 
Hessian  allows  us  to  say  more  about  the  limit  points  of  the  algorithm  than  merely  that  they 
are  stationary  points.  The  following  result  shows  that  second-order  necessary  conditions 
(Theorem  2.3)  are  satisfied  at  the  limit  points. 

Theorem  4.8. 

Suppose  that  the  assumptions  of  Theorem  4.6  are  satisfied  and  in  addition  that  f  is  twice 
continuously  differentiable  in  the  level  set  S .  Suppose  that  Bk  —  V2  f(xk)  for  allk,  and  thatthe 
approximate  solution  pk  of  (4.3)  at  each  iteration  satisfies  (4.52)  for  some  fixed  y  >  0.  Then 
iim^oc  || ^-11  =  0. 

If,  in  addition,  the  level  set  S  of  (4.24)  is  compact,  then  either  the  algorithm  terminates 
at  a  point  Xk  at  which  the  second-order  necessary  conditions  (Theorem  2.3)  for  a  local  solution 
hold,  or  else  {vi}  has  a  limit  point  x*  in  S  at  which  the  second-order  necessary  conditions  hold. 

We  omit  the  proof,  which  can  be  found  in  More  and  Sorensen  [214,  Section  4]. 


4.4  LOCAL  CONVERGENCE  OF  TRUST-REGION  NEWTON 
METHODS 


Since  global  convergence  of  trust-region  methods  that  use  exact  Hessians  V2/(x)  is  estab¬ 
lished  above,  we  turn  our  attention  now  to  local  convergence  issues.  The  key  to  attaining 
the  fast  rate  of  convergence  usually  associated  with  Newton’s  method  is  to  show  that  the 
trust-region  bound  eventually  does  not  interfere  as  we  approach  a  solution.  Specifically,  we 
hope  that  near  the  solution,  the  (approximate)  solution  of  the  trust-region  subproblem  is 
well  inside  the  trust  region  and  becomes  closer  and  closer  to  the  true  Newton  step.  Steps 
that  satisfy  the  latter  property  are  said  to  be  asymptotically  similar  to  Newton  steps. 

We  first  prove  a  general  result  that  applies  to  any  algorithm  of  the  form  of  Algo¬ 
rithm  4.1  (see  Chapter  4)  that  generates  steps  that  are  asymptotically  similar  to  Newton 
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steps  whenever  the  Newton  steps  easily  satisfy  the  trust-region  bound.  It  shows  that  the 
trust-region  constraint  eventually  becomes  inactive  in  algorithms  with  this  property  and 
that  superlinear  convergence  can  be  attained.  The  result  assumes  that  the  exact  Hessian 
Bk  —  V2f(xk)  is  used  in  (4.3)  when  xk  is  close  to  a  solution  x*  that  satisfies  second-order 
sufficient  conditions  (see  Theorem  2.4).  Moreover,  it  assumes  that  the  algorithm  uses  an 
approximate  solution  pk  of  (4.3)  that  achieves  a  similar  decrease  in  the  model  function  mk 
as  the  Cauchy  point. 

Theorem  4.9. 

Let  f  be  twice  Lipschitz  continuously  differentiable  in  a  neighborhhod  of  a  point  x*  at 
which  second-order  sufficient  conditions  (Theorem  2.4)  are  satisfied.  Suppose  the  sequence  {xk} 
converges  to  x *  and  that  for  all  k  sufficiently  large,  the  trust-region  algorithm  based  on  (4.3) 
with  Bk  —  V2  f(xk)  chooses  steps  pk  that  satisfy  the  Cauchy-point-based  model  reduction 
criterion  (4.20)  and  are  asymptotically  similar  to  Newton  steps  pi  whenever  ||p^||  <  4  A k, 
that  is, 

IIP*  -  P*ll  =  o(IIP*ID-  (4-53) 

Then  the  trust-region  bound  Ak  becomes  inactive  for  all  k  sufficiently  large  and  the  sequence 
{xk}  converges  superlinearly  to  x*. 

Proof.  We  show  that  ||p£||  <  \Ak  and  \\pk\\  <  A*,  for  all  sufficiently  large  k,  so  the 
near-optimal  step  pk  in  (4.53)  will  eventually  always  be  taken. 

We  first  seek  a  lower  bound  on  the  predicted  reduction  mk( 0)  —  mk(pk )  for  all 
sufficiently  large  k.  We  assume  that  k  is  large  enough  that  the  o(||p*||)  term  in  (4.53)  is  less 
than  \\pl\\.  When  \\p%\\  <  4  A*,  we  then  have  that  ||p*||  <  \\p%\\  +  o(||p£ ||)  <  2||p£||,  while 
if  || pj) ||  >  4  A*,  we  have  ||p^||  <  Ak  <  2 ||p^||.  In  both  cases,  then,  we  have 

IIP* II  <2KII  <2||v2/(x,r1||  ||**||, 

and  so  ||g*||  >  |||p*||/ ||  V2/(x*)_I  ||. 

We  have  from  the  relation  (4.20)  that 

mk{ 0)  -  mk(pk ) 

-c,te"mh,(a,'Fvfb|) 

^  IIP* II  •  „  _ JlPtll _ \ 

_Cl2|v2/(x*ri  mm  \  Pk  ’  2|v2/(**)||  |v2/(jc*rl|/ 

_  c  _ II  Pc  II 2 _ 

C14||v2/(x,)-1||2||v2/(x,)||' 

Because  xk  — >  x*,  we  use  continuity  of  V2/(x)  and  positive  definiteness  of  V2/(x*),  to 
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deduce  that  the  following  bound  holds  for  all  k  sufficiently  large: 


4||V2/ta)-||2||V2/(x,)| 


_ Cl _ 

8  | V2/(x*)_1  f  ||  V2/(x*)| 


def 

=  c3, 


where  c 3  >  0.  Hence,  we  hae 


mk{ 0)  -  mk(pk)  >  c3||p*||2  (4.54) 

for  all  sufficiently  large  k.  By  Lipschitz  continuity  of  V2/(x)  near  x*,  and  using  Taylor’s 
theorem  (Theorem  2.1),  we  have 

I  (/(**)  -  /(■**  +  Pk ))  -  (mk( 0)  -  mk(pk))\ 

=  \Pk  V2f(xk)pk  -  \  f  pTk  y2f{xk  +  tpk)pk  dt 
Jo 

<  \\\Pk\t 

where  L  >  0  is  the  Lipschitz  constant  for  V2/(-).  Hence,  by  definition  (4.4)  of  pk,  we  have 
for  sufficiently  large  k  that 


IP*  —  II  < 


\\pk\\\L/A) 

C3IIWII2 


L 

4  c3 


II Wll  < 


(4.55) 


Now,  the  trust-region  radius  can  be  reduced  only  if  pk  <  |  (or  some  other  fixed  number  less 
than  1),  so  it  is  clear  from  (4.55)  that  the  sequence  {Ak}  is  bounded  away  from  zero.  Since 
xk  — >  x*,  we  have  ||  pj?  ||  -*  0  and  therefore  ||  pk  ||  ->  0  from  (4.53).  Hence,  the  trust-region 
bound  is  inactive  for  all  k  sufficiently  large,  and  the  bound  ||  p^  ||  <  |  A^.  is  eventually  always 
satisfied. 

To  prove  superlinear  convergence,  we  use  the  quadratic  convergence  of  Newton’s 
method,  proved  in  Theorem  3.5.  In  particular,  we  have  from  (3.33)  that 


II**  +  Pk  ~  x*\\  —  o  (||xjt  -  x*  ||2) , 


which  implies  that  ||p]j?||  =  0(\\xk  —  x*||).  Therefore,  using  (4.53),  we  have 


II**  +  Pk  -  **ll 

<  II-**  +  Pk  -  **ll  +  lip*  -  P*ll  =  O  (||x*  -  X*||2)  +  o(IIPtll)  =  O  (\\xk  -  x*||) , 
thus  proving  superlinear  convergence.  □ 


It  is  immediate  from  Theorem  3.5  that  if  pk  —  p ^  for  all  k  sufficiently  large,  we  have 
quadratic  convergence  of  {x*}  to  x*. 
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Reasonable  implementations  of  the  dogleg,  subspace  minimization,  and  nearly-exact 
algorithm  of  Section  4.3  with  B k  —  V2f{xk)  eventually  use  the  steps  pk  —  pi  under  the 
conditions  of  Theorem  4.9,  and  therefore  converge  quadratically.  In  the  case  of  the  dogleg  and 
two-dimensional  subspace  minimization  methods,  the  exact  step  pi  is  one  of  the  candidates 
for  pt — it  lies  inside  the  trust  region,  along  the  dogleg  path,  and  inside  the  two-dimensional 
subspace.  Since  under  the  assumptions  of  Theorem  4.9,  pi  is  the  unconstrained  minimizer 
of  mk  for  k  sufficiently  large,  it  is  certainly  the  minimizer  in  the  more  restricted  domains, 
so  we  have  pk  =  pi-  For  the  approach  of  Section  4.3,  if  we  follow  the  reasonable  strategy 
of  checking  whether  pi  is  a  solution  of  (4.3)  prior  to  embarking  on  Algorithm  4.3,  then 
eventually  we  will  also  have  pk  =  pi  also. 


4.5  OTHER  ENHANCEMENTS 

SCALING 

As  we  noted  in  Chapter  2,  optimization  problems  are  often  posed  with  poor  scaling — 
the  objective  function  /  is  highly  sensitive  to  small  changes  in  certain  components  of 
the  vector  x  and  relatively  insensitive  to  changes  in  other  components.  Topologically,  a 
symptom  of  poor  scaling  is  that  the  minimizer  x*  lies  in  a  narrow  valley,  so  that  the  contours 
of  the  objective  /(•)  near  x*  tend  towards  highly  eccentric  ellipses.  Algorithms  that  fail  to 
compensate  for  poor  scaling  can  perform  badly;  see  Figure  2.7  for  an  illustration  of  the  poor 
performance  of  the  steepest  descent  approach. 

Recalling  our  definition  of  a  trust  region — a  region  around  the  current  iterate  within 
which  the  model  mk(-)  is  an  adequate  representation  of  the  true  objective  /(•) — it  is  easy 
to  see  that  a  spherical  trust  region  may  not  be  appropriate  when  /  is  poorly  scaled.  Even  if 
the  model  Hessian  Bk  is  exact,  the  rapid  changes  in  /  along  certain  directions  probably  will 
cause  mk  to  be  a  poor  approximation  to  /  along  these  directions.  On  the  other  hand,  mk 
may  be  a  more  reliable  approximation  to  /  along  directions  in  which  /  is  changing  more 
slowly.  Since  the  shape  of  our  trust  region  should  be  such  that  our  confidence  in  the  model 
is  more  or  less  the  same  at  all  points  on  the  boundary  of  the  region,  we  are  led  naturally 
to  consider  elliptical  trust  regions  in  which  the  axes  are  short  in  the  sensitive  directions  and 
longer  in  the  less  sensitive  directions. 

Elliptical  trust  regions  can  be  defined  by 

II Dp\\  <  A,  (4.56) 

where  D  is  a  diagonal  matrix  with  positive  diagonal  elements,  yielding  the  following  scaled 
trust-region  subproblem: 

min  mk(p)  =  fk  +  gl p  +  \pT Bkp 


s.t.  \\Dp\\  <  Ak. 


(4.57) 
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When  /  (x )  is  highly  sensitive  to  the  value  of  the ;  th  component  x, ,  we  set  the  corresponding 
diagonal  element  da  of  D  to  be  large,  while  da  is  smaller  for  less-sensitive  components. 

Information  to  construct  the  scaling  matrix  D  may  be  derived  from  the  second 
derivatives  3 2  f/dxf.  We  can  allow  D  to  change  from  iteration  to  iteration;  most  of  the 
theory  of  this  chapter  will  still  apply  with  minor  modifications  provided  that  each  da  stays 
within  some  predetermined  range  [r/i0,  <4i],  where  0  <  d\0  <  d^i  <  oo.  Of  course,  we  do 
not  need  D  to  be  a  precise  reflection  of  the  scaling  of  the  problem,  so  it  is  not  necessary  to 
devise  elaborate  heuristics  or  to  perform  extensive  computations  to  get  it  just  right. 

The  following  procedure  shows  how  the  Cauchy  point  calculation  (Algorithm  4.2) 
changes  when  we  use  a  scaled  trust  region, 

Algorithm  4.4  (Generalized  Cauchy  Point  Calculation). 

Find  the  vector  p\  that  solves 

Pk  =  argmin  fk  +  g[p  s.t.  \\Dp\\  <  A*;  (4.58) 


Calculate  the  scalar  xk  >  0  that  minimizes  mk(xpsk)  subject  to  satisfying  the  trust-region 
bound,  that  is, 


rk  =  argmin  mk{xpsk)  s.t.  \\xDp\\\  <  Ak; 

T>0 

Pk  =  TkPk- 


(4.59) 


For  this  scaled  version,  we  find  that 


Pk  =  -1 


Aa-  2 
;D  gk, 


\D-lgk\\ 

and  that  the  step  length  xk  is  obtained  from  the  following  modification  of  (4.12): 
1 


(4.60) 


rk  = 


\D~lgk\ 


A  kgTkD-iBkD-igk 


if  gi  D~lBkD~Lgk  <  0 
1  )  otherwise. 


(4.61) 


(The  details  are  left  as  an  exercise.) 

A  simpler  alternative  for  adjusting  the  definition  of  the  Cauchy  point  and  the  various 
algorithms  of  this  chapter  to  allow  for  the  elliptical  trust  region  is  simply  to  rescale  the 
variables  p  in  the  subproblem  (4.57)  so  that  the  trust  region  is  spherical  in  the  scaled 
variables.  By  defining 


~  def  r> 

P  =  Dp, 
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and  by  substituting  into  (4.57),  we  obtain 

min  mk(p)  =  fk  +  g[ D~l  p  +  \pT  D~l  BkD~l  p  s.t.  ||p||  <  Ak. 

pe  R” 

The  theory  and  algorithms  can  now  be  derived  in  the  usual  way  by  substituting  p  for  p, 
D~lgk  for  gk,  D~lBkD1  for  Bk,  and  so  on. 

TRUST  REGIONS  IN  OTHER  NORMS 

Trust  regions  may  also  be  defined  in  terms  of  norms  other  than  the  Euclidean  norm. 
For  instance,  we  may  have 

II />lli  <  Ak  or  llplloo  <  A k, 

or  their  scaled  counterparts 

II  Op ||i  <  Ak  or  ||  Op || oo  <  Ak, 

where  D  is  a  positive  diagonal  matrix  as  before.  Norms  such  as  these  offer  no  obvious  ad¬ 
vantages  for  small-medium  unconstrained  problems,  but  they  maybe  useful  for  constrained 
problems.  For  instance,  for  the  bound-constrained  problem 

min  f{x),  subject  to  x  >  0, 

xeRn 

the  trust-region  subproblem  may  take  the  form 

min  mk(p)  —  fk  +  g[ p  +  \pT Bkp  s.t.  xk  +  p  >  0,  \\p\\  <  Ak.  (4.62) 

pe  Rn 

When  the  trust  region  is  defined  by  a  Euclidean  norm,  the  feasible  region  for  (4.62)  consists  of 
the  intersection  of  a  sphere  and  the  nonnegative  orthant — an  awkward  object,  geometrically 
speaking.  When  the  oo-norm  is  used,  however,  the  feasible  region  is  simply  the  rectangular 
box  defined  by 


xk  +  p  >  0,  p  >  —Ake,  p  <  Ake, 

where  e  —  (1,  1, . . . ,  l)r,  so  the  solution  of  the  subproblem  is  easily  calculated  by  using 
techniques  for  bound-constrained  quadratic  programming. 

For  large  problems,  in  which  factorization  or  formation  the  model  Hessian  Bk  is  not 
computationally  desirable,  the  use  of  a  trust  region  defined  by  ||  •  ||oo  will  also  give  rise  to  a 
bound-constrained  subproblem,  which  may  be  more  convenient  to  solve  than  the  standard 
subproblem  (4.3).  To  our  knowledge,  there  has  not  been  much  research  on  the  relative 
performance  of  methods  that  use  trust  regions  of  different  shapes  on  large  problems. 
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NOTES  AND  REFERENCES 

One  of  the  earliest  works  on  trust-region  methods  is  Winfield  [307].  The  influential 
paper  of  Powell  [244]  proves  a  result  like  Theorem  4.5  for  the  case  of  rj  =  0,  where  the  algo¬ 
rithm  takes  a  step  whenever  it  decreases  the  function  value.  Powell  uses  a  weaker  assumption 
than  ours  on  the  matrices  1 1 B  \  \ ,  but  his  analysis  is  more  complicated.  More  [211]  summarizes 
developments  in  algorithms  and  software  before  1982,  paying  particular  attention  to  the 
importance  of  using  a  scaled  trust-region  norm. 

Byrd,  Schnabel,  and  Schultz  [279],  [54]  provide  a  general  theory  for  inexact  trust- 
region  methods;  they  introduce  the  idea  of  two-dimensional  subspace  minimization  and 
also  focus  on  proper  handling  of  the  case  of  indefinite  B  to  ensure  stronger  local  convergence 
results  than  Theorems  4.5  and  4.6.  Dennis  and  Schnabel  [93]  survey  trust- region  methods  as 
part  of  their  overview  of  unconstrained  optimization,  providing  pointers  to  many  important 
developments  in  the  literature. 

The  monograph  of  Conn,  Gould,  and  Toint  [74]  is  an  exhaustive  treatment  of  the  state 
of  the  art  in  trust-region  methods  for  both  unconstrained  and  constrained  optimization.  It 
includes  an  comprehensive  annotated  bibliography  of  the  literature  in  the  area. 


&  Exercises 

&  4.1  Let  f(x)  —  10(x2  —  x2)2  +  (1  —  xi)2.  At  x  —  (0,  —1)  draw  the  contour  lines  of 

the  quadratic  model  (4.2)  assuming  that  B  is  the  Hessian  of  /.  Draw  the  family  of  solutions 
of  (4.3)  as  the  trust  region  radius  varies  from  A  =  0  to  A  =  2.  Repeat  this  at  x  —  (0,  0.5). 

&  4.2  Write  a  program  that  implements  the  dogleg  method.  Choose  Bk  to  be  the  exact 

Hessian.  Apply  it  to  solve  Rosenbrock’s  function  (2.22).  Experiment  with  the  update  rule 
for  the  trust  region  by  changing  the  constants  in  Algorithm  4.1,  or  by  designing  your  own 
rules. 


&  4.3  Program  the  trust-region  method  based  on  Algorithm  7.2.  Choose  B^  to  be  the 

exact  Hessian,  and  use  it  to  minimize  the  function 


n 

min  f(x)  —  [(1  -  X2i-i)2  +  10 (x2,  -  x^)2] 

i=i 


with  n  —  10.  Experiment  with  the  starting  point  and  the  stopping  test  for  the  CG  iteration. 
Repeat  the  computation  with  n  —  50. 

Your  program  should  indicate,  at  every  iteration,  whether  Algorithm  7.2  encountered 
negative  curvature,  reached  the  trust-region  boundary,  or  met  the  stopping  test. 
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i#7  4.4  Theorem  4.5  shows  that  the  sequence  {||g||}  has  an  accumulation  point  at  zero. 

Show  that  if  the  iterates  x  stay  in  a  bounded  set  B ,  then  there  is  a  limit  point  Xoo  of  the 
sequence  {ay}  such  that  g(x0 0)  =  0. 

i#7  4.5  Show  that  ry  defined  by  (4.12)  does  indeed  identify  the  minimizer  of  nij,  along 

the  direction  —  g*. 

i#7  4.6  The  Cauchy-Schwarz  inequality  states  that  for  any  vectors  u  and  v,  we  have 


| uTv\2  <  (uT u)(vT v). 


with  equality  only  when  u  and  v  are  parallel.  When  B  is  positive  definite,  use  this  inequality 
to  show  that 


def  llgll4  < 

Y  ( gTBg)(gTB~lg )  “ 

with  equality  only  if  g  and  Bg  (and  B~1g)  are  parallel. 

i#7  4.7  When  B  is  positive  definite,  the  double-dogleg  method  constructs  a  path  with  three 

line  segments  from  the  origin  to  the  full  step.  The  four  points  that  define  the  path  are 

•  the  origin; 

•  the  unconstrained  Cauchy  step  pc  —  —(gTg)/(gTBg)g; 

•  a  fraction  of  the  full  step  y  pB  —  —yB~1g,  for  some  y  e  (y,  1],  where  y  is  defined  in 
the  previous  question;  and 

•  the  full  step  pB  —  —B~lg. 

Show  that  ||/7 1|  increases  monotonically  along  this  path. 

(Note:  The  double-dogleg  method,  as  discussed  in  Dennis  and  Schnabel  [92,  Section 
6.4.2],  was  for  some  time  thought  to  be  superior  to  the  standard  dogleg  method,  but  later 
testing  has  not  shown  much  difference  in  performance.) 

i#7  4.8  Show  that  (4.43)  and  (4.44)  are  equivalent.  Hints:  Note  that 


-f— ) 

dx  VlIpWII/ 


i(||^)n2r1/2 


d 

dX 


||/7(A.)||2  =  -2^ 

i= i 


(qjg)2 

( Xj  +  A)3 
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(from  (4.39)),  and 

Ml 1  =  ll'r'pll2  =  PT(B  +  =  E 

j= i  ykj  + 

4.9  Derive  the  solution  of  the  two-dimensional  subspace  minimization  problem  in 
the  case  where  B  is  positive  definite. 

&  4.10  Show  that  if  B  is  any  symmetric  matrix,  then  there  exists  X  >  0  such  that  B  +  XI 

is  positive  definite. 

&  4.11  Verify  that  the  definitions  (4.60)  for  p\  and  (4.61)  for  z >  are  valid  for  the  Cauchy 

point  in  the  case  of  an  elliptical  trust  region.  (Hint:  Using  the  theory  of  Chapter  12,  we  can 
show  that  the  solution  of  (4.58)  satisfies  gk  +  aD2p\  —  0  for  some  scalar  a  >  0.) 

4.12  The  following  example  shows  that  the  reduction  in  the  model  function  m 
achieved  by  the  two-dimensional  minimization  strategy  can  be  much  smaller  than  that 
achieved  by  the  exact  solution  of  (4.5). 

In  (4.5),  set 


where  e  is  a  small  positive  number.  Set 

B  =  diag  1,  e3^  ,  A  =  0.5. 

Show  that  the  solution  of  (4.5)  has  components  (O(e),  |  +  O(e),  0(e))T  and  that  the 
reduction  in  the  model  m  is  |  +  0(e).  For  the  two-dimensional  minimization  strategy, 
show  that  the  solution  is  a  multiple  of  B~lg  and  that  the  reduction  in  m  is  0(e). 


Chapter 


Conjugate 
Gradient  Methods 


Our  interest  in  conjugate  gradient  methods  is  twofold.  First,  they  are  among  the  most  useful 
techniques  for  solving  large  linear  systems  of  equations.  Second,  they  can  be  adapted  to  solve 
nonlinear  optimization  problems.  The  remarkable  properties  of  both  linear  and  nonlinear 
conjugate  gradient  methods  will  be  described  in  this  chapter. 

The  linear  conjugate  gradient  method  was  proposed  by  Hestenes  and  Stiefel  in  the 
1950s  as  an  iterative  method  for  solving  linear  systems  with  positive  definite  coefficient 
matrices.  It  is  an  alternative  to  Gaussian  elimination  that  is  well  suited  for  solving  large 
problems.  The  performance  of  the  linear  conjugate  gradient  method  is  determined  by  the 


102  Chapter  5.  Conjugate  Gradient  Methods 


distribution  of  the  eigenvalues  of  the  coefficient  matrix.  By  transforming,  or  preconditioning , 
the  linear  system,  we  can  make  this  distribution  more  favorable  and  improve  the  convergence 
of  the  method  significantly.  Preconditioning  plays  a  crucial  role  in  the  design  of  practical 
conjugate  gradient  strategies.  Our  treatment  of  the  linear  conjugate  gradient  method  will 
highlight  those  properties  of  the  method  that  are  important  in  optimization. 

The  first  nonlinear  conjugate  gradient  method  was  introduced  by  Fletcher  and  Reeves 
in  the  1960s.  It  is  one  of  the  earliest  known  techniques  for  solving  large-scale  nonlinear 
optimization  problems.  Over  the  years,  many  variants  of  this  original  scheme  have  been 
proposed,  and  some  are  widely  used  in  practice.  The  key  features  of  these  algorithms  are 
that  they  require  no  matrix  storage  and  are  faster  than  the  steepest  descent  method. 


5.1  THE  LINEAR  CONJUGATE  GRADIENT  METHOD 


In  this  section  we  derive  the  linear  conjugate  gradient  method  and  discuss  its  essential 
convergence  properties.  For  simplicity,  we  drop  the  qualifier  “linear”  throughout. 

The  conjugate  gradient  method  is  an  iterative  method  for  solving  a  linear  system  of 
equations 


Ax  =  b,  (5.1) 

where  A  is  an  n  x  n  symmetric  positive  definite  matrix.  The  problem  (5.1)  can  be  stated 
equivalently  as  the  following  minimization  problem: 

min  <p{x)  \xT Ax  —  bTx,  (5.2) 

that  is,  both  (5.1)  and  (5.2)  have  the  same  unique  solution.  This  equivalence  will  allow  us 
to  interpret  the  conjugate  gradient  method  either  as  an  algorithm  for  solving  linear  systems 
or  as  a  technique  for  minimizing  convex  quadratic  functions.  For  future  reference,  we  note 
that  the  gradient  of  < p  equals  the  residual  of  the  linear  system,  that  is, 

V</>(x)  =  Ax  —  b  /■(■*■),  (5.3) 

so  in  particular  at  x  =  xk  we  have 


rk  —  Axk  -  b. 


(5.4) 


CONJUGATE  DIRECTION  METHODS 

One  of  the  remarkable  properties  of  the  conjugate  gradient  method  is  its  ability  to 
generate,  in  a  very  economical  fashion,  a  set  of  vectors  with  a  property  known  as  conjugacy.  A 
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set  of  nonzero  vectors  [po,  pi , . . . ,  pi}  is  said  to  be  conjugate  with  respect  to  the  symmetric 
positive  definite  matrix  A  if 


pj  Apj  =0,  for  alii  yb  j.  (5.5) 

It  is  easy  to  show  that  any  set  of  vectors  satisfying  this  property  is  also  linearly  independent. 
(For  a  geometrical  illustration  of  conjugate  directions  see  Section  9.4.) 

The  importance  of  conjugacy  lies  in  the  fact  that  we  can  minimize  </>(•)  in  n  steps 
by  successively  minimizing  it  along  the  individual  directions  in  a  conjugate  set.  To  verify 
this  claim,  we  consider  the  following  conjugate  direction  method.  (The  distinction  between 
the  conjugate  gradient  method  and  the  conjugate  direction  method  will  become  clear  as  we 

proceed.)  Given  a  starting  point  xo  e  R"  and  a  set  of  conjugate  directions  {p0,  p\ . pn~  i}, 

let  us  generate  the  sequence  {ay}  by  setting 


Xk+i  =  xk  + ctkpk,  (5.6) 

where  a k  is  the  one-dimensional  minimizer  of  the  quadratic  function  </>(•)  along  ay  +  apk, 
given  explicitly  by 


ay  —  - 


T 

rkPk  . 
Pk  Apk  ’ 


(5.7) 


see  (3.55).  We  have  the  following  result. 

Theorem  5.1. 

For  any x0  e  R"  the  sequence  {ay}  generated  by  the  conjugate  direction  algorithm  (5.6), 
(5.7)  converges  to  the  solution  a*  of  the  linear  system  (5.1 )  in  at  mostn  steps. 

Proof.  Since  the  directions  { p ,■ }  are  linearly  independent,  they  must  span  the  whole  space 
R” .  Hence,  we  can  write  the  difference  between  Xo  and  the  solution  a*  in  the  following  way: 

■r*  -  *o  =  obA)  +  &iPi  d - b 

for  some  choice  of  scalars  ay .  By  premultiplying  this  expression  by  p\  A  and  using  the 
conjugacy  property  (5.5),  we  obtain 


ok 


pf  A(x*  -  a0) 
Pk  APk 


(5.8) 


We  now  establish  the  result  by  showing  that  these  coefficients  ak  coincide  with  the  step 
lengths  ay  generated  by  the  formula  (5.7). 
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lixk  is  generated  by  algorithm  (5.6),  (5.7),  then  we  have 

Xk  —  x0  +  a0Po  +  «i Pi  H - h  oik-iPk-i- 

By  premultiplying  this  expression  by  p\  A  and  using  the  conjugacy  property,  we  have  that 

Pk  A(xk  -  x0)  —  0, 

and  therefore 

p[ A(x*  -  x0)  =  p[  A(x*  -  xk)  -  p[ ( b  -  Axk)  -  -pTkrk. 

By  comparing  this  relation  with  (5.7)  and  ( 5 . 8 ) ,  we  find  that  ak  —  oik  >  giving  the  result.  □ 

There  is  a  simple  interpretation  of  the  properties  of  conjugate  directions.  If  the  matrix 
A  in  (5.2)  is  diagonal,  the  contours  of  the  function  <p{-)  are  ellipses  whose  axes  are  aligned 
with  the  coordinate  directions,  as  illustrated  in  Figure  5.1.  We  can  find  the  minimizer  of  this 
function  by  performing  one-dimensional  minimizations  along  the  coordinate  directions 


Figure  5.1  Successive  minimizations  along  the  coordinate  directions  find  the 
minimizer  of  a  quadratic  with  a  diagonal  Hessian  in  n  iterations. 
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Figure  5.2  Successive  minimization  along  coordinate  axes  does  not  find  the  solution 
in  n  iterations,  for  a  general  convex  quadratic. 

e\,  ei,  .  ■  ■ ,  en  in  turn.  When  A  is  not  diagonal,  its  contours  are  still  elliptical,  but  they 
are  usually  no  longer  aligned  with  the  coordinate  directions.  The  strategy  of  successive 
minimization  along  these  directions  in  turn  no  longer  leads  to  the  solution  in  n  iterations  (or 
even  in  a  finite  number  of  iterations).  This  phenomenon  is  illustrated  in  the  two-dimensional 
example  of  Figure  5.2  We  can,  however,  recover  the  nice  behavior  of  Figure  5. 1  if  we  transform 
the  problem  to  make  A  diagonal  and  then  minimize  along  the  coordinate  directions.  Suppose 
we  transform  the  problem  by  defining  new  variables  x  as 

x  =  S~lx,  (5.9) 

where  S  is  the  n  x  n  matrix  defined  by 

S  =  [po  Pi  ■■  ■  Pn- 1], 

where  {/?0,  Pi,  ■  ■  ■ ,  Pn- il  is  the  set  of  conjugate  directions  with  respect  to  A.  The  quadratic 
4>  defined  by  (5.2)  now  becomes 

4>(x)  =f  (p(Sx)  —  \xt{StAS)x  —  ( STb)T x . 

By  the  conjugacy  property  (5.5),  the  matrix  ST  AS  is  diagonal,  so  we  can  find  the  minimizing 
value  of  (f>  by  performing  n  one-dimensional  minimizations  along  the  coordinate  directions 
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of  x.  Because  of  the  relation  (5.9),  however,  the  ith  coordinate  direction  in  x-space  corre¬ 
sponds  to  the  direction  p ,  in  x -space.  Hence,  the  coordinate  search  strategy  applied  to  4>  is 
equivalent  to  the  conjugate  direction  algorithm  (5.6),  (5.7).  We  conclude,  as  in  Theorem  5.1, 
that  the  conjugate  direction  algorithm  terminates  in  at  most  n  steps. 

Returning  to  Figure  5. 1,  we  note  another  interesting  property:  When  the  Hessian  ma¬ 
trix  is  diagonal,  each  coordinate  minimization  correctly  determines  one  of  the  components 
of  the  solution  x*.  In  other  words,  after  k  one-dimensional  minimizations,  the  quadratic 
has  been  minimized  on  the  subspace  spanned  by  e\,  e2, . . . ,  ek.  The  following  theorem 
proves  this  important  result  for  the  general  case  in  which  the  Hessian  of  the  quadratic  is  not 
necessarily  diagonal.  (Here  and  later,  we  use  the  notation  span{po,  p i,  . . . ,  pk]  to  denote 
the  set  of  all  linear  combinations  of  the  vectors  Po,  Pi,  ■  ■  ■ ,  Pk-)  In  proving  the  result  we  will 
make  use  of  the  following  expression,  which  is  easily  verified  from  the  relations  (5.4)  and 
(5.6): 


n+ 1  =  n  +  akApk. 


(5.10) 


Theorem  5.2  (Expandins  Subspace  Minimization). 

Letx  o  e  1R"  beany  starting  point  and  suppose  that  the  sequence  {xk}  is  generated  by  the 
conjugate  direction  algorithm  (5.6),  (5.7).  Then 

rl pi  =  0,  for  i  —  0,  1, . . . ,  k  —  1,  (5.11) 

andxk  is  the  minimizer  of<p(x )  =  Ax  —  bT x  over  the  set 

{x\x  —  x0  +  spanfpo,  Pu - Pk-i}}-  (5.12) 


PROOF.  We  begin  by  showing  that  a  point  x  minimizes  </>  over  the  set  (5.12)  if  and  only 
if  r(x)T pi  —  0,  for  each  i  —  0,  1, . . . ,  k  —  1.  Let  us  define  h{a)  —  <p(x o  +  cr0p0  +  ■  ■  ■  + 
ak- iPk-i),  where  er  =  (cr0,  cri, . . . ,  crk_l)T .  Since  h(o)  is  a  strictly  convex  quadratic,  it  has 
a  unique  minimizer  a*  that  satisfies 


dh(<j*) 

9(7/ 


i  =  0,  1,  ...,k-  1. 


By  the  chain  rule,  this  equation  implies  that 


S/(p(x  o  +  (Tgp0  H - h  o^pk-ifpi  =0,  i  =  0,  1 - -  k  -  1. 

By  recalling  the  definition  (5.3),  we  have  for  the  minimizer  x  —  xo  +  cr^po  +  cr( p2  +  ■  •  ■  + 
&k-iPk-i  on  the  set  (5.12)  that  r(x)T pi  —  0,  as  claimed. 

We  now  use  induction  to  show  that  xk  satisfies  (5.11).  For  the  case  k  =  1,  we  have 
from  the  fact  that  X\  —  Xq  +  a0  po  minimizes  (f>  along  p0  that  rf  p0  —  0.  Let  us  now  make 
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the  induction  hypothesis,  namely,  that  r[_ lpi  —  0  for  i  —  0,  1,  . . . ,  k  —  2.  By  (5.10),  we 
have 


n  —  n~i  +  oik-i Apk-i, 


so  that 


Pk-irk  =  pl-Sk-i  +  ak-\pI-iApk-i  -  0, 

by  the  definition  (5.7)  of  oik- 1-  Meanwhile,  for  the  other  vectors  p,,  i  —  0,  1, . . . ,  k  —  2,  we 
have 


pjrk  —  pJrk-\  +ak-ipf  Apk-i  —  0, 

where  pj rk~ i  =  0  because  of  the  induction  hypothesis  and  pj  Apk-\  —  0  because  of 
conjugacy  of  the  vectors  We  have  shown  that  r[  Pi  =  0,  for  i  =  0,  ,k  —  1,  so  the 
proof  is  complete.  □ 

The  fact  that  the  current  residual  rk  is  orthogonal  to  all  previous  search  directions,  as 
expressed  in  (5.1 1),  is  a  property  that  will  be  used  extensively  in  this  chapter. 

The  discussion  so  far  has  been  general,  in  that  it  applies  to  a  conjugate  direction 

method  (5.6),  (5.7)  based  on  any  choice  of  the  conjugate  direction  set  { p0 ,  p\ . pn~\ }. 

There  are  many  ways  to  choose  the  set  of  conjugate  directions.  For  instance,  the  eigen¬ 
vectors  Vi,  V2, . . . ,  vn  of  A  are  mutually  orthogonal  as  well  as  conjugate  with  respect  to 

A,  so  these  could  be  used  as  the  vectors  { p0 ,  px . p„-i}-  For  large-scale  applications, 

however,  computation  of  the  complete  set  of  eigenvectors  requires  an  excessive  amount  of 
computation.  An  alternative  approach  is  to  modify  the  Gram-Schmidt  orthogonalization 
process  to  produce  a  set  of  conjugate  directions  rather  than  a  set  of  orthogonal  directions. 
(This  modification  is  easy  to  produce,  since  the  properties  of  conjugacy  and  orthogonality 
are  closely  related  in  spirit.)  However,  the  Gram-Schmidt  approach  is  also  expensive,  since 
it  requires  us  to  store  the  entire  direction  set. 

BASIC  PROPERTIES  OF  THE  CONJUGATE  GRADIENT  METHOD 

The  conjugate  gradient  method  is  a  conjugate  direction  method  with  a  very  special 
property:  In  generating  its  set  of  conjugate  vectors,  it  can  compute  a  new  vector  pk  by 
using  only  the  previous  vector  pk-\.  It  does  not  need  to  know  all  the  previous  elements 
Po,  p\, . . . ,  Pk-2  of  the  conjugate  set;  pk  is  automatically  conjugate  to  these  vectors.  This 
remarkable  property  implies  that  the  method  requires  little  storage  and  computation. 

In  the  conjugate  gradient  method,  each  direction  pk  is  chosen  to  be  a  linear  combi¬ 
nation  of  the  negative  residual  —rk  (which,  by  (5.3),  is  the  steepest  descent  direction  for  the 
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function  (f> )  and  the  previous  direction  Pk-i  ■  We  write 

Pk  —  ~rk  +  PkPk-U  (5-13) 

where  the  scalar  fa  is  to  be  determined  by  the  requirement  that  pk- 1  and  pk  must  be 
conjugate  with  respect  to  A.  By  premultiplying  (5.13)  by  p[_]  A  and  imposingthe  condition 
Pk-\Apk  —  0,  we  find  that 


rApk- 1 


Pk-iAPk- 1 


We  choose  the  first  search  direction  p0  to  be  the  steepest  descent  direction  at  the  initial  point 
xq .  As  in  the  general  conjugate  direction  method,  we  perform  successive  one- dimensional 
minimizations  along  each  of  the  search  directions.  We  have  thus  specified  a  complete 
algorithm,  which  we  express  formally  as  follows: 

Algorithm  5.1  (CG-Preliminaty  Version). 

Given  Xo; 

Set  r0  <-  Ax0  -  b,  p0  a - r0,  k  a-  0; 

while  rk  ^  0 


O/k 

T 

rkPk  . 

Pi  APk  ’ 

(5.14a) 

%k+ 1 

A-  Xk  +Otkpk\ 

(5.14b) 

n+i 

Axk+i  ~  b\ 

(5.14c) 

fk+ 1 

,  rk+lAPk . 

PlAPk  ’ 

(5.14d) 

Pk+ 1 

* - rk+ 1  +  Pk+lPk\ 

(5.14e) 

k 

< —  k  -\-  \\ 

(5. 14f ) 

end  (while) 


This  version  is  useful  for  studying  the  essential  properties  of  the  conjugate  gradient 
method,  but  we  present  a  more  efficient  version  later.  We  show  first  that  the  directions 
Po,  P\,  ■ . . ,  pn-i  are  indeed  conjugate,  which  by  Theorem  5.1  implies  termination  in  n 
steps.  The  theorem  below  establishes  this  property  and  two  other  important  properties. 
First,  the  residuals  r(-  are  mutually  orthogonal.  Second,  each  search  direction  pk  and  residual 
rk  is  contained  in  the  Krylov  subspace  of  degree  k  for  r0,  defined  as 


AC(r0;  k)  =  span{r0,  Ar0, - AA>0}. 


(5.15) 
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Theorem  5.3. 

Suppose  that  thekth  iterate  generated  by  the  conjugate  gradient  method  is  not  the  solution 


point  x*.  The  following  four  properties  hold: 

rf  rj  —  0,  for  i  —  0,  1, . . . ,  k  —  1,  (5.16) 

span{/‘0,  r\ - -  rk)  —  span{r0,  Ar0, ...,  Akr0],  (5.17) 

span{p0,  pi - ,  pk)  —  span{r0,  Ar0, . . . ,  Akr0 },  (5.18) 

pf  Api  =  0,  for  i  —  0,  1, . . . ,  k  —  1.  (5.19) 


Therefore,  the  sequence  {xk}  converges  to  x*  in  at  most n  steps. 

Proof.  Theproofisbyinduction.Theexpressions(5.17)and(5.18)holdtriviallyforA  =  0, 
while  (5.19)  holds  by  construction  for  k  =  1 .  Assuming  now  that  these  three  expressions  are 
true  for  some  k  (the  induction  hypothesis),  we  show  that  they  continue  to  hold  for  k  +  1. 

To  prove  (5.17),  we  show  first  that  the  set  on  the  left-hand  side  is  contained  in  the  set 
on  the  right-hand  side.  Because  of  the  induction  hypothesis,  we  have  from  (5.17)  and  (5.18) 
that 


rk  e  span{r0,  Ar0,  . . . ,  Akr0 },  pk  e  span{r0,  Ar0,  ,  Akr0 }, 


while  by  multiplying  the  second  of  these  expressions  by  A,  we  obtain 


Apk  e  span{Ar0 . Ak+lr0}. 


(5.20) 


By  applying  (5.10),  we  find  that 


rk+ i  €  span{r0,  Ar0,  ...,  Ak+1r0}- 


By  combining  this  expression  with  the  induction  hypothesis  for  (5.17),  we  conclude  that 


span{r0,  ru  . . . ,  rk,  rk+i}  C  span{/-0,  Ar0,  . . . ,  AA’+1r0}. 


To  prove  that  the  reverse  inclusion  holds  as  well,  we  use  the  induction  hypothesis  on  (5.18) 
to  deduce  that 


Ak+1r0  =  A(Akr0)  e  span{Ap0,  Apx - -  Apk }. 


Since  by  (5.10)  we  have  Ap,-  =  (rI+ 1  —  r^/a;  for  i  —  0,  1, . . . ,  k,  it  follows  that 


Ak+'r0  e  span{/‘0,  ru  . . . ,  rk+1}. 
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By  combining  this  expression  with  the  induction  hypothesis  for  (5.17),  we  find  that 


span{r0,  Ar0, - Aa+1/-0}  C  span{r0,  ru  . . . ,  rk ,  rA+ 1 } . 


Therefore,  the  relation  (5.17)  continues  to  hold  when  k  is  replaced  by  k  +  1,  as  claimed. 

We  show  that  (5.18)  continues  to  hold  when  k  is  replaced  by  k  +  1  by  the  following 
argument: 


span} /t0,  pi,...,pk,  pk+ 1) 

=  span{p0,  pu  pk,  rk+1} 

=  span{r0,  Ar0 . Akr0,  rk+{\ 

=  span{r0,  ru  ...,rk,  rk+1] 

=  span{r0,  Ar0 . Ak+1r0] 


by  (5.14e) 

by  induction  hypothesis  for  (5.18) 
by  (5.17) 

by  (5.17)  for  k  +  1. 


Next,  we  prove  the  conjugacy  condition  (5.19)  with  k  replaced  by  k  +  1 .  By  multiplying 
(5.14e)  by  Apt,  i  —  0,  1, . . . ,  k,  we  obtain 


Pk+\APi  =  ~rk+iAPi  +  Pk+ 1 PTk  APi  ■  (5-21) 

By  the  definition  (5.14d)  of  fik,  the  right-hand-side  of  (5.21)  vanishes  when  i  —  k.  For 
i  <  k  —  1  we  need  to  collect  a  number  of  observations.  Note  first  that  our  induction 
hypothesis  for  (5.19)  implies  that  the  directions  p0,  pi,...,  pk  are  conjugate,  so  we  can 
apply  Theorem  5.2  to  deduce  that 


rk+lpi  —  0,  for  i  —  0,  1, . . . ,  k.  (5.22) 

Second,  by  repeatedly  applying  (5.18),  we  find  that  for  i  —  0,  1,  . . . ,  k  —  1,  the  following 
inclusion  holds: 


A  pi  e  A  span{r0,  Ar0 , A'  r0}  =  span{Ar0,  A2r0,  A'+1r0} 

C  span{/r0,  pi, ,  Pi+\}-  (5.23) 

By  combining  (5.22)  and  (5.23),  we  deduce  that 

r[+lApj  —  0,  for i  —  0,  1,  1, 

so  the  first  term  in  the  right-hand-side  of  (5.21)  vanishes  for  i  —  0,  1, . . . ,  k  —  1.  Be¬ 
cause  of  the  induction  hypothesis  for  (5.19),  the  second  term  vanishes  as  well,  and  we 
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conclude  that  pl+lAp,  =0,;  =0,1,...,  A:.  Hence,  the  induction  argument  holds  for  (5. 19) 
also. 

It  follows  that  the  direction  set  generated  by  the  conjugate  gradient  method  is  indeed 
a  conjugate  direction  set,  so  Theorem  5.1  tells  us  that  the  algorithm  terminates  in  at  most  n 
iterations. 

Finally,  we  prove  (5.16)  by  a  noninductive  argument.  Because  the  direction  set  is 
conjugate,  we  have  from  (5.11)  that  r[  =  0  for  all  i  —  0,  1, ....  k  —  1  and  any  k  = 
1,2 —  1.  By  rearranging  (5. 14e),  we  find  that 


Pi  =  -n  +  Pi- 1, 

so  that  i'j  e  span{/T,-,  for  all  i  =  1, . . . ,  k  —  1.  We  conclude  that  =  0  for  all 
i  —  l, ...  ,k  —  1.  To  complete  the  proof,  we  note  that  r[  r0  —  —r[  p0  =  0,  by  definition  of 
p0  in  Algorithm  5. 1  and  by  (5. 11).  □ 


The  proof  of  this  theorem  relies  on  the  fact  that  the  first  direction  p0  is  the  steep¬ 
est  descent  direction  —  r0;  in  fact,  the  result  does  not  hold  for  other  choices  of  po .  Since 
the  gradients  are  mutually  orthogonal,  the  term  “conjugate  gradient  method”  is  ac¬ 
tually  a  misnomer.  It  is  the  search  directions,  not  the  gradients,  that  are  conjugate  with 
respect  to  A. 


A  PRACTICAL  FORM  OF  THE  CONJUGATE  GRADIENT  METHOD 

We  can  derive  a  slightly  more  economical  form  of  the  conjugate  gradient  method  by 
using  the  results  of  Theorems  5.2  and  5.3.  First,  we  can  use  (5.14e)  and  (5.11)  to  replace  the 
formula  (5.14a)  fora^  by 


oik 


T 

h  n 

Pk  APk 


Second,  we  have  from  (5.10)  that  oikApk  —  rk+\  —  rk,  so  by  applying  (5.14e)  and  (5.11) 
once  again  we  can  simplify  the  formula  for  fik+i  to 


Pk+i  — 


rI+l>'k+ 1 

T 

rkrk 


By  using  these  formulae  together  with  (5.10),  we  obtain  the  following  standard  form  of  the 
conjugate  gradient  method. 
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Algorithm  5.2  (CG). 


Given  xq; 

Set  r0  <-  Ax 0  -  b,  p0  <- 

while  rk  0 

—Jo,  k  -a-  0; 

T 

rk  rk 

ak  T  , 

Pk  APk 

(5.24a) 

xk+i  *-  xk  +  akpk; 

(5.24b) 

rk+ 1  Ot  +  akApk\ 

(5.24c) 

a  ,  rl+  ir*+i . 

Pk-\- 1  ^  t  > 

rkn 

(5.24d) 

Pk+ 1  « - rk+ 1  +  Pk+\Pk\ 

(5.24e) 

k  < —  k  +  1; 

(5.24f) 

end  (while) 

At  any  given  point  in  Algorithm  5.2  we  never  need  to  know  the  vectors  x,  r ,  and 
p  for  more  than  the  last  two  iterations.  Accordingly,  implementations  of  this  algorithm 
overwrite  old  values  of  these  vectors  to  save  on  storage.  The  major  computational  tasks  to  be 
performed  at  each  step  are  computation  of  the  matrix-vector  product  Apk,  calculation  of 
the  inner  products  p[  ( Apk )  and  rL  jT>+i,  and  calculation  of  three  vector  sums.  The  inner 
product  and  vector  sum  operations  can  be  performed  in  a  small  multiple  of n  floating-point 
operations,  while  the  cost  of  the  matrix-vector  product  is,  of  course,  dependent  on  the 
problem.  The  CG  method  is  recommended  only  for  large  problems;  otherwise,  Gaussian 
elimination  or  other  factorization  algorithms  such  as  the  singular  value  decomposition  are 
to  be  preferred,  since  they  are  less  sensitive  to  rounding  errors.  For  large  problems,  the  CG 
method  has  the  advantage  that  it  does  not  alter  the  coefficient  matrix  and  (in  contrast  to 
factorization  techniques)  does  not  produce  fill  in  the  arrays  holding  the  matrix.  Another  key 
property  is  that  the  CG  method  sometimes  approaches  the  solution  quickly,  as  we  discuss 
next. 


RATE  OF  CONVERGENCE 

We  have  seen  that  in  exact  arithmetic  the  conjugate  gradient  method  will  terminate  at 
the  solution  in  at  most  n  iterations.  What  is  more  remarkable  is  that  when  the  distribution 
of  the  eigenvalues  of  A  has  certain  favorable  features,  the  algorithm  will  identify  the  solution 
in  many  fewer  than  n  iterations.  To  explain  this  property,  we  begin  by  viewing  the  expanding 
subspace  minimization  property  proved  in  Theorem  5.2  in  a  slightly  different  way,  using  it 
to  show  that  Algorithm  5.2  is  optimal  in  a  certain  important  sense. 
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From  (5.24b)  and  (5.18),  we  have  that 


Xk+ 1  =  x0  +  a0p0  H - h  akpk 

—  xo  +  Yo>'o  +  KiA/'o  +  •  •  •  +  YkAkr  o,  (5.25) 

for  some  constants  y,-.  We  now  define  Pk{-)  to  be  a  polynomial  of  degree  k  with  coefficients 
Yo,  Yu  ■  ■  ■ ,  Yk-  Like  any  polynomial,  Pk  can  take  either  a  scalar  or  a  square  matrix  as  its 
argument.  For  the  matrix  argument  A,  we  have 

Pk(A)  =  Yol  +  KiA  +  ■  ■  ■  +  YkAk , 

which  allows  us  to  express  (5.25)  as  follows: 


xk+  i  =  x0  +  Pk{A)r0.  (5.26) 

We  now  show  that  among  all  possible  methods  whose  first  k  steps  are  restricted  to  the 
Krylov  subspace  /C(ro;  k)  given  by  (5.15),  Algorithm  5.2  does  the  best  job  of  minimizing  the 
distance  to  the  solution  after  k  steps,  when  this  distance  is  measured  by  the  weighted  norm 
measure  ||  •  |U  defined  by 


IIzIIa  =  (5-27) 

(Recall  that  this  norm  was  used  in  the  analysis  of  the  steepest  descent  method  of  Chapter  3.) 
Using  this  norm  and  the  definition  of  4>  (5.2),  and  the  fact  that  x*  minimizes  (p,  it  is  easy  to 
show  that 


2  \\x  —  x*\\\  —  j(x  —  x*)T  A(x  —  x*)  =  <p{x )  —  4>{x*).  (5.28) 

Theorem  5.2  states  that  xk+i  minimizes  <p,  and  hence  ||x  —  x*\\\,  over  the  set  xo  + 
span{/T0,  p  i,  . . . ,  pk } ,  which  by  (5.18)  isthesameasxo  +  span{r0,  Ar0, . . . ,  Akr0}.  Itfollows 
from  (5.26)  that  the  polynomial  P ^  solves  the  following  problem  in  which  the  minimum  is 
taken  over  the  space  of  all  possible  polynomials  of  degree  k: 

min  ||x0  +  Pk(A)ro  —  x*||a-  (5.29) 

We  exploit  this  optimality  property  repeatedly  in  the  remainder  of  the  section. 

Since 


/‘o  =  Axq  —  b  =  Ax  o  —  Ax*  =  A(xq  —  x*). 


we  have  that 


xk+1  -  x*  =  xo  +  Pk*(A)r0  -  x*  =  [I  +  Pk*(A)A](x 0  -  x*). 


(5.30) 
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Let  0  <  Xi  <  A.2  <  •  •  •  <  A„  be  the  eigenvalues  of  A,  and  let  vi,V2, ... ,  v„  be  the 
corresponding  orthonormal  eigenvectors,  so  that 

A  = 

1  =  1 

Since  the  eigenvectors  span  the  whole  space  Rn,  we  can  write 

n 

xp-x*  —  T,  g,  v, ,  (5.31) 

i=i 

for  some  coefficients  £,■ .  It  is  easy  to  show  that  any  eigenvector  of  A  is  also  an  eigenvector 
of  Pk(A )  for  any  polynomial  Pk.  For  our  particular  matrix  A  and  its  eigenvalues  A,-  and 
eigenvectors  u,-,  we  have 


Pk(A)Vi  =  Pk(ki)vit  i  —  1,2, ,  n. 


By  substituting  (5.31)  into  (5.30)  we  have 

n 

**+1 -**  =  !> +kip*(ximivi. 

i= 1 

By  using  the  fact  that  ||z||^  =  zT  Az  =  J2'i= i  G  ( v[ z)2,  we  have 

n 

\\xk+1-x*\\2A  =  £a,-[1  +  kiP*(Xi)]2^2.  (5.32) 

i=i 

Since  the  polynomial  generated  by  the  CG  method  is  optimal  with  respect  to  this  norm, 
we  have 

n 

W*k+\  -  x*\\\  —  min  ^  a,[1  +  XiPk(Xi)]2%2. 

Pk 

1  =  1 

By  extracting  the  largest  of  the  terms  [1  +  A,-  Pk  (A,- )  ] 2  from  this  expression,  we  obtain  that 


\\xk+i  -x*\\\  <  min  max[l  +  A,  /\(AI)]2  V' A 

p*  i<;<«  z y 


G=1 

=  min  max  [1  +  AI  /\(AI)]2||x0 
Pk  1  </<« 


(5.33) 


where  we  have  used  the  fact  that  ||x0  —  **11^  =  Y^)=  i  kj%j. 
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The  expression  (5.33)  allows  us  to  quantify  the  convergence  rate  of  the  CG  method 
by  estimating  the  nonnegative  scalar  quantity 

min  max  [1  +  Pj.(A,  )]2.  (5.34) 

Pk  \<i<n 

In  other  words,  we  search  for  a  polynomial  that  makes  this  expression  as  small  as  possible. 
In  some  practical  cases,  we  can  find  this  polynomial  explicitly  and  draw  some  interesting 
conclusions  about  the  properties  of  the  CG  method.  The  following  result  is  an  example. 

Theorem  5.4. 

If  A  has  only  r  distinct  eigenvalues,  then  the  CG  iteration  will  terminate  at  the  solution 
in  at  mostr  iterations. 

Proof.  Suppose  that  the  eigenvalues  a,  ,  A. 2, . . . ,  X„  take  on  the  r  distinct  values  X\  <  X2  < 
■  ■  ■  <  xr.  We  define  a  polynomial  Qr{X)  by 

QrW  =  ^  ^ — {k  -  Xi)(X  —  x2)  ■  ■  ■  (X  —  r,.), 

X1X2  •  •  •  xr 

and  note  that  Qr( A.,)  =  0  for  i  =  1,  2, . . . ,  n  and  Qr( 0)  =  1.  From  the  latter  observation, 
we  deduce  that  Qr  (A)  —  1  is  a  polynomial  of  degree  r  with  a  root  at  X  —  0,  so  by  polynomial 
division,  the  function  Pr_  \  defined  by 

Pr-dX)  =  (Qr(X)  -  l)/X 

is  a  polynomial  of  degree  r  —  1.  By  setting  k  —  r  —  1  in  (5.34),  we  have 

0  <  min  max[l  +  A,P,_i(A.()]2  <  max[l  +  Xj Pr_1{Xi)]2  —  max  Q2{Xi)  —  0. 

Pr- 1  l<i<«  1  <i<n  1  <i<n 

Hence,  the  constant  in  (5.34)  is  zero  for  the  value  k  —  r  —  1,  so  we  have  by  substituting  into 
(5.33)  that  \\xr  —  x*\\2A  —  0,  and  therefore  xr  —  x*,  as  claimed.  □ 

By  using  similar  reasoning,  Luenberger  [195]  establishes  the  following  estimate,  which 
gives  a  useful  characterization  of  the  behavior  of  the  CG  method. 

Theorem  5.5. 

If  A  has  eigenvalues  X\  <  X2  <■■■<  Xn,  we  have  that 

l|xr.+i-.r*l|2  <  \\xo~x*\\2a.  (5.35) 

\An-k  -r  Ai  / 


116  Chapter  5.  Conjugate  Gradient  Methods 


1  1  1  1  1  II 

A/ n-m 

i  i  i 

^n-m+1 

i  ill 

i  i  i 

0 

i 

Figure  5.3  Two  clusters  of  eigenvalues. 


Without  giving  details  of  the  proof,  we  describe  how  this  result  is  obtained  from  (5.33).  One 
selects  a  polynomial  of  degree  k  such  that  the  polynomial  Q^+  iM  =  1  +  has 

roots  at  the  k  largest  eigenvalues  X„,  'kn-\ , . . . ,  A„-/t+i ,  as  well  as  at  the  midpoint  between 
Ai  and  It  can  be  shown  that  the  maximum  value  attained  by  Qk+i  on  the  remaining 
eigenvalues  Xi,X2,  ■■■,  An-jt  is  precisely  (X„-k  —  Xi)/(Xn_k  +  M). 

We  now  illustrate  how  Theorem  5.5  can  be  used  to  predict  the  behavior  of  the  CG 
method  on  specific  problems.  Suppose  we  have  the  situation  plotted  in  Figure  5.3,  where 
the  eigenvalues  of  A  consist  of  m  large  values,  with  the  remaining  n  —  m  smaller  eigenvalues 
clustered  around  1.  If  we  define  e  =  A„_m  —  M,  Theorem  5.5  tells  us  that  after  m  +  1  steps 
of  the  conjugate  gradient  algorithm,  we  have 


\\xm+i  ~  x*\\A  ^  e||x0  -  x*\\A. 

For  a  small  value  of  e,  we  conclude  that  the  CG  iterates  will  provide  a  good  estimate  of  the 
solution  after  only  m  +  1  steps. 

Figure  5.4  shows  the  behavior  of  CG  on  a  problem  of  this  type,  which  has  five  large 
eigenvalues  with  all  the  smaller  eigenvalues  clustered  between  0.95  and  1.05,  and  compares 
this  behavior  with  that  of  CG  on  a  problem  in  which  the  eigenvalues  satisfy  some  random 
distribution.  In  both  cases,  we  plot  the  log  of  (p  after  each  iteration. 

For  the  problem  with  clustered  eigenvalues,  Theorem  5.5  predicts  a  sharp  decrease  in 
the  error  measure  at  iteration  6.  Note,  however,  that  this  decrease  was  achieved  one  iteration 
earlier,  illustrating  the  fact  that  Theorem  5.5  gives  only  an  upper  bound,  and  that  the  rate  of 
convergence  can  be  faster.  By  contrast,  we  observe  in  Figure  5.4  that  for  the  problem  with 
randomly  distributed  eigenvalues  (dashed  line),  the  convergence  rate  is  slower  and  more 
uniform. 

Figure  5.4  illustrates  another  interesting  feature:  After  one  more  iteration  (a  total 
of  seven)  on  the  problem  with  clustered  eigenvalues,  the  error  measure  drops  sharply.  An 
extension  of  the  arguments  leading  to  Theorem  5.4  explains  this  behavior.  It  is  almost 
true  to  say  that  the  matrix  A  has  just  six  distinct  eigenvalues:  the  five  large  eigenvalues 
and  1 .  Then  we  would  expect  the  error  measure  to  be  zero  after  six  iterations.  Because  the 
eigenvalues  near  1  are  slightly  spread  out,  however,  the  error  does  not  become  very  small  until 
iteration  7. 
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Figure  5.4  Performance  of  the  conjugate  gradient  method  on  (a)  a  problem  in  which 
five  of  the  eigenvalues  are  large  and  the  remainder  are  clustered  near  1 ,  and  (b)  a  matrix 
with  uniformly  distributed  eigenvalues. 


To  state  this  claim  more  precisely,  it  is  generally  true  that  if  the  eigenvalues  occur  in  r 
distinct  clusters,  the  CG  iterates  will  approximately  solve  the  problem  in  about  r  steps  (see 
[136]).  This  result  can  be  proved  by  constructing  a  polynomial  Pr-\  such  that  (1  +XPr_\  (X)) 
has  zeros  inside  each  of  the  clusters.  This  polynomial  may  not  vanish  at  the  eigenvalues  Xj, 
i  —  1,2 but  its  value  will  be  small  at  these  points,  so  the  constant  defined  in  (5.34) 
will  be  small  for  k  >  r  —  1.  We  illustrate  this  behavior  in  Figure  5.5,  which  shows  the 
performance  of  CG  on  a  matrix  of  dimension  n  —  14  that  has  four  clusters  of  eigenvalues: 
single  eigenvalues  at  140  and  120,  a  cluster  of  10  eigenvalues  very  close  to  10,  with  the 
remaining  eigenvalues  clustered  between  0.95  and  1.05.  After  four  iterations,  the  error  has 
decreased  significantly.  After  six  iterations,  the  solution  is  identified  to  good  accuracy. 

Another,  more  approximate,  convergence  expression  for  CG  is  based  on  the  Euclidean 
condition  number  of  A,  which  is  defined  by 

k(A)  —  HAH2IIA  =  Xn/X\. 


It  can  be  shown  that 


II At  ~  x*\\ A  <  2 


■Jk(A)  —  1 
y/  K(A)  +  1 


k 

11*0  -**IU- 


(5.36) 


This  bound  often  gives  a  large  overestimate  of  the  error,  but  it  can  be  useful  in  those  cases 


Figure  5.5  Performance  of  the  conjugate  gradient  method  on  a  matrix  in  which  the 
eigenvalues  occur  in  four  distinct  clusters. 

where  the  only  information  we  have  about  A  is  estimates  of  the  extreme  eigenvalues  a, 
and  A„ .  This  bound  should  be  compared  with  that  of  the  steepest  descent  method  given  by 
(3.29),  which  is  identical  in  form  but  which  depends  on  the  condition  number  k(A),  and 
not  on  its  square  root  ~Jk(A). 

PRECONDITIONING 

We  can  accelerate  the  conjugate  gradient  method  by  transforming  the  linear  system 
to  improve  the  eigenvalue  distribution  of  A.  The  key  to  this  process,  which  is  known  as 
preconditioning,  is  a  change  of  variables  from  x  to  x  via  a  nonsingular  matrix  C,  that  is, 

x  —  Cx.  (5.37) 

The  quadratic  (f>  defined  by  (5.2)  is  transformed  accordingly  to 

<j>(x)  =  \xT  {C~T  AC~l)x  -  ( C~Tb)Tx .  (5.38) 

If  we  use  Algorithm  5.2  to  minimize  <p  or,  equivalently,  to  solve  the  linear  system 

{C~T  AC~l)x  =  C~Tb, 

then  the  convergence  rate  will  depend  on  the  eigenvalues  of  the  matrix  C~T  AC~l  rather 
than  those  of  A.  Therefore,  we  aim  to  choose  C  such  that  the  eigenvalues  of  C~T  AC~l 
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are  more  favorable  for  the  convergence  theory  discussed  above.  We  can  try  to  choose  C 
such  that  the  condition  number  of  C~T  AC~l  is  much  smaller  than  the  original  condition 
number  of  A,  for  instance,  so  that  the  constant  in  (5.36)  is  smaller.  We  could  also  try  to 
choose  C  such  that  the  eigenvalues  of  C~T  AC~l  are  clustered,  which  by  the  discussion  of 
the  previous  section  ensures  that  the  number  of  iterates  needed  to  find  a  good  approximate 
solution  is  not  much  larger  than  the  number  of  clusters. 

It  is  not  necessary  to  carry  out  the  transformation  (5.37)  explicitly.  Rather,  we  can 
apply  Algorithm  5.2  to  the  problem  (5.38),  in  terms  of  the  variables  x,  and  then  invert  the 
transformations  to  reexpress  all  the  equations  in  terms  of  x .  This  process  of  derivation  results 
in  Algorithm  5.3  (Preconditioned  Conjugate  Gradient),  which  we  now  define.  It  happens 
that  Algorithm  5.3  does  not  make  use  of  C  explicitly,  but  rather  the  matrix  M  =  CTC, 
which  is  symmetric  and  positive  definite  by  construction. 

Algorithm  5.3  (Preconditioned  CG). 

Given  Xq,  preconditioner  M; 

Set  r0  a-  Axq  —  b ; 

Solve  Myo  =  r0  for  y0; 

Set  p0  —  —  }’0)  k  a-  0; 

while  rk  +  0 


oik 

,  rkyk  . 

Pk  APk  ' 

(5.39a) 

Xk+ 1 

xk  +  akpk\ 

(5.39b) 

n+\ 

<-  rk+  akApk\ 

(5.39c) 

Solve  Myk+i 

=  rk+ 1; 

(5.39d) 

Pk+ 1 

rk+ 1^+1 . 

T  ’ 

rk  yk 

(5.39e) 

Pk+  i 

< - yk+\  +  Pk+lPkt 

(5.39f) 

k 

< —  k  -\~  lj 

(5.39g) 

end  (while) 


If  we  set  M  —  I  in  Algorithm  5.3,  we  recover  the  standard  CG  method,  Algorithm  5.2. 
The  properties  of  Algorithm  5.2  generalize  to  this  case  in  interesting  ways.  In  particular,  the 
orthogonality  property  (5.16)  of  the  successive  residuals  becomes 


rj  M~lrj  —  0  for  all  i  /  j . 


(5.40) 
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In  terms  of  computational  effort,  the  main  difference  between  the  preconditioned 
and  unpreconditioned  CG  methods  is  the  need  to  solve  systems  of  the  form  My  —  r  (step 
(5.39d)). 


PRACTICAL  PRECONDITIONERS 

No  single  preconditioning  strategy  is  “best”  for  all  conceivable  types  of  matrices: 
The  tradeoff  between  various  objectives — effectiveness  of  M,  inexpensive  computation  and 
storage  of  M,  inexpensive  solution  of  My  —  r — varies  from  problem  to  problem. 

Good  preconditioning  strategies  have  been  devised  for  specific  types  of  matrices,  in 
particular,  those  arising  from  discretizations  of  partial  differential  equations  (PDEs).  Often, 
the  preconditioner  is  defined  in  such  a  way  that  the  system  My  —  r  amounts  to  a  simplified 
version  of  the  original  system  Ax  =  b.  In  the  case  of  a  PDE,  My  —  r  could  represent 
a  coarser  discretization  of  the  underlying  continuous  problem  than  Ax  =  b.  As  in  many 
other  areas  of  optimization  and  numerical  analysis,  knowledge  about  the  structure  and 
origin  of  a  problem  (in  this  case,  knowledge  that  the  system  Ax  —  b  is  a  finite- dimensional 
representation  of  a  PDE)  is  the  key  to  devising  effective  techniques  for  solving  the  problem. 

General-purpose  preconditioners  have  also  been  proposed,  but  their  success  varies 
greatly  from  problem  to  problem.  The  most  important  strategies  of  this  type  include  sym¬ 
metric  successive  overrelaxation  (SSOR),  incomplete  Cholesky,  and  banded  preconditioners. 
(See  [272],  [136], and  [72]  for  discussions  of  these  techniques.)  Incomplete  Cholesky  is  prob¬ 
ably  the  most  effective  in  general.  The  basic  idea  is  simple:  We  follow  the  Cholesky  procedure, 
but  instead  of  computing  the  exact  Cholesky  factor  L  that  satisfies  A  =  LLT ,  we  compute 
an  approximate  factor  L  that  is  sparser  than  L.  (Usually,  we  require  L  to  be  no  denser,  or 
not  much  denser,  than  the  lower  triangle  of  the  original  matrix  A.)  We  then  have  A  ss  LLT , 
and  by  choosing  C  —  LT ,  we  obtain  M  —  LLT  and 

C~T AC~l  =  L~1aL-t  I, 

so  the  eigenvalue  distribution  of  C~T AC~l  is  favorable.  We  do  not  compute  M  explicitly, 
but  rather  store  the  factor  L  and  solve  the  system  My  —  r  by  performing  two  triangular 
substitutions  with  L.  Because  the  sparsity  of  L  is  similar  to  that  of  A,  the  cost  of  solving 
My  —  r  is  similar  to  the  cost  of  computing  the  matrix-vector  product  Ap. 

There  are  several  possible  pitfalls  in  the  incomplete  Cholesky  approach.  One  is  that 
the  resulting  matrix  may  not  be  (sufficiently)  positive  definite,  and  in  this  case  one  may  need 
to  increase  the  values  of  the  diagonal  elements  to  ensure  that  a  value  for  L  can  be  found. 
Numerical  instability  or  breakdown  can  occur  during  the  incomplete  factorization  because 
of  the  sparsity  conditions  we  impose  on  the  factor  L.  This  difficulty  can  be  remedied  by 
allowing  additional  fill-in  in  L,  but  the  denser  factor  will  be  more  expensive  to  compute  and 
to  apply  at  each  iteration. 
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5.2  NONLINEAR  CONJUGATE  GRADIENT  METHODS 


We  have  noted  that  the  CG  method,  Algorithm  5.2,  can  be  viewed  as  a  minimization 
algorithm  for  the  convex  quadratic  function  cp  defined  by  (5.2).  It  is  natural  to  ask  whether 
we  can  adapt  the  approach  to  minimize  general  convex  functions,  or  even  general  nonlinear 
functions  /.  In  fact,  as  we  show  in  this  section,  nonlinear  variants  of  the  conjugate  gradient 
are  well  studied  and  have  proved  to  be  quite  successful  in  practice. 

THE  FLETCHER-REEVES  METHOD 

Fletcher  and  Reeves  [107]  showed  how  to  extend  the  conjugate  gradient  method  to 
nonlinear  functions  by  making  two  simple  changes  in  Algorithm  5.2.  First,  in  place  of 
the  formula  (5.24a)  for  the  step  length  a k  (which  minimizes  (f>  along  the  search  direction 
Pk),  we  need  to  perform  a  line  search  that  identifies  an  approximate  minimum  of  the 
nonlinear  function  /  along  pu-  Second,  the  residual  r,  which  is  simply  the  gradient  of  < p  in 
Algorithm  5.2  (see  (5.3)),  must  be  replaced  by  the  gradient  of  the  nonlinear  objective  /. 
These  changes  give  rise  to  the  following  algorithm  for  nonlinear  optimization. 

Algorithm  5.4  (FR). 

Given  xo; 

Evaluate  f0  =  f(x0),  V/0  =  V /(*„); 

Set  po  < - V/0,  k  a-  0; 

while  V  fi;  /:  0 

Compute  oik  and  set  Xk+\  =  Xi-  +  a^pu'. 

Evaluate  V  fk+p, 

V//+1V/a-+  i. 

Pk+1  VffVfk  ’ 

Pk+i  < - V/*+i  +  pj*+1pk; 

k  < —  k  ■+■  1; 


(5.41a) 

(5.41b) 

(5.41c) 


end  (while) 

If  we  choose  /  to  be  a  strongly  convex  quadratic  and  oik  to  be  the  exact  minimizer,  this 
algorithm  reduces  to  the  linear  conjugate  gradient  method,  Algorithm  5.2.  Algorithm  5.4 
is  appealing  for  large  nonlinear  optimization  problems  because  each  iteration  requires  only 
evaluation  of  the  objective  function  and  its  gradient.  No  matrix  operations  are  required  for 
the  step  computation,  and  just  a  few  vectors  of  storage  are  required. 

To  make  the  specification  of  Algorithm  5.4  complete,  we  need  to  be  more  precise 
about  the  choice  of  line  search  parameter  oik.  Because  of  the  second  term  in  (5.41b),  the 
search  direction  pk  may  fail  to  be  a  descent  direction  unless  oik  satisfies  certain  conditions. 
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By  taking  the  inner  product  of  (5.41b)  (with  k  replacing  k  +  1)  with  the  gradient  vector 
V/*,  we  obtain 


v//  Pk  =  -II  v/,||2  +  /If V/f  pk_y.  (5.42) 

If  the  line  search  is  exact,  so  that  oi^-i  is  a  local  minimizer  of  /  along  the  direction  pk-u 
we  have  that  V  /f  pk_x  —  0 .  In  this  case  we  have  from  ( 5 .42 )  that  V  /f  pk  <  0,  so  that  pk  is 
indeed  a  descent  direction.  If  the  line  search  is  not  exact,  however,  the  second  term  in  (5.42) 
may  dominate  the  first  term,  and  we  may  have  V  /f  pk  >  0,  implying  that  pk  is  actually  a 
direction  of  ascent.  Fortunately,  we  can  avoid  this  situation  by  requiring  the  step  length  oik 
to  satisfy  the  strong  Wolfe  conditions,  which  we  restate  here: 

f(xk  +  akpk)  <  f{xk)  +  Cia*V/f  Pk,  (5.43a) 

|V/U*  +  akpk)TPk\  <  -c2VfkT Pk,  (5.43b) 

where  0  <  ci  <  Ci  <  f  (Note  that  we  impose  C2  <  ^  here,  in  place  of  the  looser  condition 
C2  <  1  that  was  used  in  the  earlier  statement  (3.7).)  By  applying  Lemma  5.6  below,  we  can 
show  that  condition  (5.43b)  implies  that  (5.42)  is  negative,  and  we  conclude  that  any  line 
search  procedure  that  yields  an  oik  satisfying  (5.43)  will  ensure  that  all  directions  pk  are 
descent  directions  for  the  function  /. 


THE  POLAK-RIBIERE  METHOD  AND  VARIANTS 

There  are  many  variants  of  the  Fletcher-Reeves  method  that  differ  from  each  other 
mainly  in  the  choice  of  the  parameter  Pk-  An  important  variant,  proposed  by  Polak  and 
Ribiere,  defines  this  parameter  as  follows: 


Jk+ 1 


V/tr+1(V/t+i  -  v/t) 

liv/tll2 


(5.44) 


We  refer  to  the  algorithm  in  which  (5.44)  replaces  (5.41a)  as  Algorithm  PR.  It  is  identical  to 
Algorithm  FR  when  /  is  a  strongly  convex  quadratic  function  and  the  line  search  is  exact, 
since  by  (5.16)  the  gradients  are  mutually  orthogonal,  and  so  /iff  =  /3f,  P  When  applied 
to  general  nonlinear  functions  with  inexact  line  searches,  however,  the  behavior  of  the  two 
algorithms  differs  markedly.  Numerical  experience  indicates  that  Algorithm  PR  tends  to  be 
the  more  robust  and  efficient  of  the  two. 

A  surprising  fact  about  Algorithm  PR  is  that  the  strong  Wolfe  conditions  (5.43)  do 
not  guarantee  that  pk  is  always  a  descent  direction.  If  we  define  the  ft  parameter  as 


/J++1  =  max{/if+I,0}, 


(5.45) 
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giving  rise  to  an  algorithm  we  call  Algorithm  PR+,  then  a  simple  adaptation  of  the  strong 
Wolfe  conditions  ensures  that  the  descent  property  holds. 

There  are  many  other  choices  for  fa+i  that  coincide  with  the  Fletcher-Reeves  formula 
PFk"+]  in  the  case  where  the  objective  is  quadratic  and  the  line  search  is  exact.  The  Hestenes- 
Stiefel  formula,  which  defines 


V/4i(V/*+i  -  V/*) 

(v  fk+i  -  vfkYpk 


(5.46) 


gives  rise  to  an  algorithm  (called  Algorithm  HS)  that  is  similar  to  Algorithm  PR,  both  in 
terms  of  its  theoretical  convergence  properties  and  in  its  practical  performance.  Formula 
(5.46)  can  be  derived  by  demanding  that  consecutive  search  directions  be  conjugate  with 
respect  to  the  average  Hessian  over  the  line  segment  [ay,  Xk+ 1],  which  is  defined  as 


Gk  = 


[V2/  (xk  +  rakpk)]dr. 


Recalling  from  Taylor’s  theorem  (Theorem  2.1)  that  V  /),+1  =  V/j  +  akGkPk>  we  see  that 
for  any  direction  of  the  form  pk+ 1  =  —  V  fk+i  +  fa+iPki  the  condition  pl+lGkPk  —  0 
requires  fik+ i  to  be  given  by  (5.46). 

Later,  we  see  that  it  is  possible  to  guarantee  global  convergence  for  any  parameter  /3/t 
satisfying  the  bound 


\Pk\<  Pf, 


(5.47) 


for  all  k  >  2.  This  fact  suggests  the  following  modification  of  the  PR  method,  which  has 


performed  well  on  some  applications.  For  all 

k  >  2  let 

-oFR 

-Pk 

if 

PT  < 

-pf 

Pk  = 

/oPR 

Pk 

if 

m  < 

PT 

(5.48) 

■qFR 

Pk 

if 

pf  > 

Pf • 

The  algorithm  based  on  this  strategy  will  be  denoted  by  FR-PR. 

Other  variants  of  the  CG  method  have  recently  been  proposed.  Two  choices  for  fa+i 
that  possess  attractive  theoretical  and  computational  properties  are 


Pk+ 1  — 


lIV^+ill2 

(v  fk+i  -  vfkYpk 


(5.49) 


(see  [85])  and 


-  2  pk 


ll.v*ll2\ 
91  Pk) 


V/t+i 
9k  Pk  ’ 


Pk+ 1  — 


T 


with 


9k  =  v/t+1  -  v  fk 


(5.50) 
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(see  [161]).  These  two  choices  guarantee  that  pk  is  a  descent  direction,  provided  the 
steplength  a *  satisfies  the  Wolfe  conditions.  The  CG  algorithms  based  on  (5.49)  or  (5.50) 
appear  to  be  competitive  with  the  Polak-Ribiere  method. 

QUADRATIC  TERMINATION  AND  RESTARTS 

Implementations  of  nonlinear  conjugate  gradient  methods  usually  preserve  their 
close  connections  with  the  linear  conjugate  gradient  method.  Usually,  a  quadratic  (or  cubic) 
interpolation  along  the  search  direction  p *  is  incorporated  into  the  line  search  procedure;  see 
Chapter  3.  This  feature  guarantees  that  when  f  is  a  strictly  convex  quadratic,  the  step  length 
a*  is  chosen  to  be  the  exact  one-dimensional  minimizer,  so  that  the  nonlinear  conjugate 
gradient  method  reduces  to  the  linear  method,  Algorithm  5.2. 

Another  modification  that  is  often  used  in  nonlinear  conjugate  gradient  procedures 
is  to  restart  the  iteration  at  every  n  steps  by  setting  ft  =  0  in  (5.41a),  that  is,  by  taking 
a  steepest  descent  step.  Restarting  serves  to  periodically  refresh  the  algorithm,  erasing  old 
information  that  may  not  be  beneficial.  We  can  even  prove  a  strong  theoretical  result  about 
restarting:  It  leads  to  u-step  quadratic  convergence,  that  is, 

\\xk+n  -x\\  —  0  (||**  -  -T* II2) .  (5.51) 

After  a  little  thought,  this  result  is  not  so  surprising.  Consider  a  function  /  that  is  strongly 
convex  quadratic  in  a  neighborhood  of  the  solution,  but  is  nonquadratic  everywhere  else. 
Assuming  that  the  algorithm  is  converging  to  the  solution  in  question,  the  iterates  will 
eventually  enter  the  quadratic  region.  At  some  point,  the  algorithm  will  be  restarted  in  that 
region,  and  from  that  point  onward,  its  behavior  will  simply  be  that  of  the  linear  conjugate 
gradient  method,  Algorithm  5.2.  In  particular,  finite  termination  will  occur  within  n  steps 
of  the  restart.  The  restart  is  important,  because  the  finite-termination  property  and  other 
appealing  properties  of  Algorithm  5.2  hold  only  when  its  initial  search  direction  p0  is  equal 
to  the  negative  gradient. 

Even  if  the  function  /  is  not  exactly  quadratic  in  the  region  of  a  solution,  Taylor’s 
theorem  (Theorem  2. 1 )  implies  that  it  can  still  be  approximated  quite  closely  by  a  quadratic, 
provided  that  it  is  smooth.  Therefore,  while  we  would  not  expect  termination  in  n  steps 
after  the  restart,  it  is  not  surprising  that  substantial  progress  is  made  toward  the  solution,  as 
indicated  by  the  expression  (5.51). 

Though  the  result  (5.51)  is  interesting  from  a  theoretical  viewpoint,  it  may  not  be 
relevant  in  a  practical  context,  because  nonlinear  conjugate  gradient  methods  can  be  recom¬ 
mended  only  for  solving  problems  with  large  n.  Restarts  may  never  occur  in  such  problems 
because  an  approximate  solution  may  be  located  in  fewer  than  n  steps.  Hence,  nonlinear 
CG  method  are  sometimes  implemented  without  restarts,  or  else  they  include  strategies  for 
restarting  that  are  based  on  considerations  other  than  iteration  counts.  The  most  popular 
restart  strategy  makes  use  of  the  observation  (5.16),  which  is  that  the  gradients  are  mutually 
orthogonal  when  /  is  a  quadratic  function.  A  restart  is  performed  whenever  two  consecutive 
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gradients  are  far  from  orthogonal,  as  measured  by  the  test 


|V//V/t-il  > 

II  v  A- II 2  “ 


(5.52) 


where  a  typical  value  for  the  parameter  v  is  0.1. 

We  could  also  think  of  formula  (5.45)  as  a  restarting  strategy,  because  pk+i  will  revert 
to  the  steepest  descent  direction  whenever  is  negative.  In  contrast  to  (5.52),  these  restarts 
are  rather  infrequent  because  /3™  is  positive  most  of  the  time. 


BEHAVIOR  OF  THE  FLETCHER-REEVES  METHOD 

We  now  investigate  the  Fletcher-Reeves  algorithm,  Algorithm  5.4,  a  little  more  closely, 
proving  that  it  is  globally  convergent  and  explaining  some  of  its  observed  inefficiencies. 

The  following  result  gives  conditions  on  the  line  search  under  which  all  search  direc¬ 
tions  are  descent  directions.  It  assumes  that  the  level  set  £  =  {x  :  fix)  <  /(x0)}  is  bounded 
and  that  /  is  twice  continuously  differentiable,  so  that  we  have  from  Lemma  3.1  that  there 
exists  a  step  length  a k  satisfying  the  strong  Wolfe  conditions. 

Lemma  5.6. 

Suppose  that  Algorithm  5.4  is  implemented  with  a  step  length  a*  that  satisfies  the  strong 
Wolfe  conditions  (5.43)  with  0  <  C2  <  \.  Then  the  method  generates  descent  directions  pu 
that  satisfy  the  following  inequalities: 


1  <  yil  pk  < 2ci  ~ 1 

1  Cl  ||V/,||2  -  1-C2  ’ 


for  all  k  —  0,  1,  . . . . 


(5.53) 


Proof.  Note  first  that  the  function  f(§)  =  (2f  —  1 )/( 1  —  f )  is  monotonically  increasing 
on  the  interval  [0,  |]  and  that  t( 0)  =  —1  and  t(^)  =  0.  Hence,  because  of  c2  e  (0,  |),  we 
have 


-  1  < 


2c2-  1 
1  -  c2 


<  0. 


(5.54) 


The  descent  condition  V  ffi p^  <  0  follows  immediately  once  we  establish  (5.53). 

The  proof  is  by  induction.  For  k  —  0,  the  middle  term  in  (5.53)  is  —1,  so  by  using 
(5.54),  we  see  that  both  inequalities  in  (5.53)  are  satisfied.  Next,  assume  that  (5.53)  holds 
for  some  k  >  1.  From  (5.41b)  and  (5.41a)  we  have 


=  _  y/.+.w 

+  l|V/,+i||2 


Vfk+lpk 

IIV/,112  ■ 


(5.55) 


l|V/*+i 


2 


2 


126  Chapter  5.  Conjugate  Gradient  Methods 


By  using  the  line  search  condition  (5.43b),  we  have 

|V/*+iP*l  <  -ci V// Pki 

so  by  combining  with  (5.55)  and  recalling  (5.41a),  we  obtain 

-i  +  CiYILel  <  < 

IIV/a-H2  -  ||V/*+1||2  -  2||V/,||2 

Substituting  for  the  term  V  Pk/W  V/a-||2  from  the  left-hand-side  of  the  induction 
hypothesis  (5.53),  we  obtain 


_1  _  Cl  <  ^  fk+lPk+l  <  _  c2 

1  -c2-  ||V/,+1||2  -  1  -  C2’ 

which  shows  that  (5.53)  holds  for  A:  +  1  as  well.  □ 

This  result  used  only  the  second  strong  Wolfe  condition  (5.43b);  the  first  Wolfe 
condition  (5.43a)  will  be  needed  in  the  next  section  to  establish  global  convergence.  The 
bounds  on  V  pk  in  (5.53)  impose  a  limit  on  how  fast  the  norms  of  the  steps  ||p/t||  can 
grow,  and  they  will  play  a  crucial  role  in  the  convergence  analysis  given  below. 

Lemma  5.6  can  also  be  used  to  explain  a  weakness  of  the  Fletcher-Reeves  method. 
We  will  argue  that  if  the  method  generates  a  bad  direction  and  a  tiny  step,  then  the  next 
direction  and  next  step  are  also  likely  to  be  poor.  As  in  Chapter  3,  we  let  Ok  denote  the  angle 
between  pk  and  the  steepest  descent  direction  —  V/*,  defined  by 


cos  Ok  — 


~V.fkT  Pk 
II V AH  || Pk || ' 


(5.56) 


Suppose  that  pk  is  a  poor  search  direction,  in  the  sense  that  it  makes  an  angle  of  nearly  90° 
with  —V  fk,  that  is,  cos  Ok  ~  0.  By  multiplying  both  sides  of  (5.53)  by  ||  VAII/IIPill  and 
using  (5.56),  we  obtain 


1  —  2c2  ||  V A  ||  „  ^  1  ||  V A II 

- <  cos  Ok  < - , 

1  -c2  \\pk\\  1  -c2  \\pk\\ 


for  all  k  =  0,  1, . . .. 


(5.57) 


From  these  inequalities,  we  deduce  that  cos  Ok  &  0  if  and  only  if 


II  v  A II  «  llwll- 


Since  pk  is  almost  orthogonal  to  the  gradient,  it  is  likely  that  the  step  from  Xk  to  Xk+ i  is  tiny, 
that  is,  Xk+ 1  &  Xk  -  If  so,  we  have  VA+i  ^  VA>  and  therefore 


“Tt+l 


1, 


(5.58) 
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by  the  definition  (5.41a).  By  using  this  approximation  together  with  ||  V/i-+i  ||  ss  1 1 V  /* 1 1 
\\pk\\  in  (5.41b),  we  conclude  that 


Pk+i  ^  Pk, 

so  the  new  search  direction  will  improve  little  (if  at  all)  on  the  previous  one.  It  follows  that 
if  the  condition  cos  Ok  ^  0  holds  at  some  iteration  k  and  if  the  subsequent  step  is  small,  a 
long  sequence  of  unproductive  iterates  will  follow. 

The  Polak-Ribiere  method  behaves  quite  differently  in  these  circumstances.  If,  as  in 
the  previous  paragraph,  the  search  direction  pk  satisfies  cos  Ok  &  0  for  some  k,  and  if  the 
subsequent  step  is  small,  it  follows  by  substituting  V/*  ss  V  fk+i  into  (5.44)  that  fif+x  ss  0. 
From  the  formula  (5.41b),  we  find  that  the  new  search  direction  Pk+\  will  be  close  to  the 
steepest  descent  direction  —V  fk+i,  and  cos  0t+ 1  will  be  close  to  1.  Therefore,  Algorithm  PR 
essentially  performs  a  restart  after  it  encounters  a  bad  direction.  The  same  argument  can 
be  applied  to  Algorithms  PR+  and  HS.  For  the  FR-PR  variant,  defined  by  (5.48),  we  have 
noted  already  that  fi™+x  «  1,  and  fi'f+x  0.  The  formula  (5.48)  thus  sets  fik+\  =  PI+ 1>  as 
desired.  Thus,  the  modification  (5.48)  seems  to  avoid  the  inefficiencies  of  the  FR  method, 
while  falling  back  on  this  method  for  global  convergence. 

The  undesirable  behavior  of  the  Fletcher-Reeves  method  predicted  by  the  arguments 
given  above  can  be  observed  in  practice.  For  example,  the  paper  [123]  describes  a  problem 
with  n  —  100  in  which  cos  Ok  is  of  order  10-2  for  hundreds  of  iterations  and  the  steps 
|| Xk  —  Xk~ 1 1|  are  of  order  10~2.  Algorithm  FR  requires  thousands  of  iterations  to  solve  this 
problem,  while  Algorithm  PR  requires  just  37  iterations.  In  this  example,  the  Fletcher- 
Reeves  method  performs  much  better  if  it  is  periodically  restarted  along  the  steepest  descent 
direction,  since  each  restart  terminates  the  cycle  of  bad  steps.  In  general,  Algorithm  FR 
should  not  be  implemented  without  some  kind  of  restart  strategy. 

GLOBAL  CONVERGENCE 

Unlike  the  linear  conjugate  gradient  method,  whose  convergence  properties  are  well 
understood  and  which  is  known  to  be  optimal  as  described  above,  nonlinear  conjugate 
gradient  methods  possess  surprising,  sometimes  bizarre,  convergence  properties.  We  now 
present  a  few  of  the  main  results  known  for  the  Fletcher-Reeves  and  Polak-Ribiere  methods 
using  practical  line  searches. 

For  the  purposes  of  this  section,  we  make  the  following  (nonrestrictive)  assumptions 
on  the  objective  function. 

Assumptions  5.1. 

(i)  The  level  set  C  :=  [x  |  f(x)  <  f(x0)}  is  bounded; 

(ii)  In  some  open  neighborhood  Af  of  C,  the  objective  function  f  is  Lipschitz  continuously 
differentiable. 
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These  assumptions  imply  that  there  is  a  constant  y  such  that 

||V/(x)||  <  y,  for  all  .v  e  C.  (5.59) 

Our  main  analytical  tool  in  this  section  is  Zoutendijk’s  theorem — Theorem  3.2  in 
Chapter  3.  It  states,  that  under  Assumptions  5.1,  any  line  search  iteration  of  the  form 
xk+l  =  xk  +  akpk,  where  pk  is  a  descent  direction  and  ak  satisfies  the  Wolfe  conditions 
(5.43)  gives  the  limit 


OO 

Y2  cos20k  II V /t|| 2  <  oo. 
k= 0 


(5.60) 


We  can  use  this  result  to  prove  global  convergence  for  algorithms  that  are  periodically 
restarted  by  setting  flk  —  0.  If  k] ,  L2>  and  so  on  denote  the  iterations  on  which  restarts  occur, 
we  have  from  (5.60)  that 


II V /a- || 2  <  oo.  (5.61) 

k=k\,k.2,... 

If  we  allow  no  more  than  n  iterations  between  restarts,  the  sequence  {kj}°°=1  is  infinite, 
and  from  (5.61)  we  have  that  lirn^oo  ||V/a.||  =  0.  That  is,  a  subsequence  of  gradients 
approaches  zero,  or  equivalently, 


liminf  ||V/*||  =  0.  (5.62) 

k^-oo 

This  result  applies  equally  to  restarted  versions  of  all  the  algorithms  discussed  in  this  chapter. 

It  is  more  interesting,  however,  to  study  the  global  convergence  of  unrestarted  conjugate 
gradient  methods,  because  for  large  problems  (say  n  >  1000)  we  expect  to  find  a  solution  in 
many  fewer  than  n  iterations — the  first  point  at  which  a  regular  restart  would  take  place.  Our 
study  of  large  sequences  of  unrestarted  conjugate  gradient  iterations  reveals  some  surprising 
patterns  in  their  behavior. 

We  can  build  on  Lemma  5.6  and  Zoutendijk’s  result  (5.60)  to  prove  a  global  conver¬ 
gence  result  for  the  Fletcher-Reeves  method.  While  we  cannot  show  that  the  limit  of  the 
sequence  of  gradients  {V/jJ  is  zero,  the  following  result  shows  that  this  sequence  is  not 
bounded  away  from  zero. 

Theorem  5.7  (Al-Baali  [3]). 

Suppose  that  Assumptions  5.1  hold,  and  that  Algorithm  5.4  is  implemented  with  a  line 
search  that  satisfies  the  strong  Wolfe  conditions  (5.43),  withO  <  c\  <  C2  <  Then 

I i/n  inf  ||  V /,||  =  0. 

k^-oo 


(5.63) 
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Proof.  The  proof  is  by  contradiction.  It  assumes  that  the  opposite  of  (5.63)  holds,  that  is, 
there  is  a  constant  y  >  0  such  that 


II V  A- 1|  >  y,  (5.64) 

for  all  k  sufficiently  large.  By  substituting  the  left  inequality  of  (5.57)  into  Zoutendijk’s 
condition  (5.60),  we  obtain 


OO 


E 


II  v  A II 4 

\\pk\\2 


<  OO. 


By  using  (5.43b)  and  (5.53),  we  obtain  that 

\vfkTpk-i\  <  -c2vfZLlPk-i  <  ^-IIV/UII2. 

1  -  c2 

Thus,  from  (5.41b)  and  recalling  the  definition  (5.41a)  of  /3|R  we  obtain 

\\Pkf  <  IIVAII2  +  2/jf  ivAV-d  +  {p™f\\pk-i\\2 
<  IIVAII2  +  -3^/invA-ill2  +  tf?)2\\Pk-i 

1  -  c2 

=  (73^)  llvAII2  +  (PtfWPk-iW2- 


(5.65) 


(5.66) 


Applying  this  relation  repeatedly,  and  defining  c3  =  (1  +  c2)/(l  —  c2)  >  1,  we  have 

llwll2  <  c3||VAII2  +  (jSD2(c3||VA-ill2  +  (AF-i)2(c3l|VA-2ll2  + 

•••  +  (/6D2IIpoII2))---) 

k 

=  c3||VAH4£  llV/yll-2,  (5.67) 

7=0 


where  we  used  the  facts  that 

(PTfiPt  i)2  •  •  •  (Pti)2  =  JfA"4|,4 

and  po  =  —  V/0.  By  using  the  bounds  (5.59)  and  (5.64)  in  (5.67),  we  obtain 


ii  Pk  ii- 2  < 


k, 


(5.68) 
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which  implies  that 


OO 


E 


i 

ii  Pk  ii 2 


OO 

^  KtE 

k= 1 


1 

k  ’ 


(5.69) 


for  some  positive  constant  y4. 

On  the  other  hand,  from  (5.64)  and  (5.65),  we  have  that 


E 


k=  1 


1 

II  Pk  II2 


<  OO. 


(5.70) 


However,  if  we  combine  this  inequality  with  (5.69),  we  obtain  that  Y^kLi  1/fc  <  oo,  which 
is  not  true.  Hence,  (5.64)  does  not  hold,  and  the  claim  (5.63)  is  proved.  □ 


This  global  convergence  result  can  be  extended  to  any  choice  of  f  *  satisfying  (5.47), 
and  in  particular  to  the  FR-PR  method  given  by  (5.48). 

In  general,  if  we  can  show  that  there  exist  constants  c4,  C5  >  0  such  that 


cos  6k  >  c4 


II  v  A- II 

II  w  II  ’ 


II  v /a  II 
II  PA  II 


>  C5  >  0, 


k=  1,2,..., 


it  follows  from  (5.60)  that 


lim  ||V/*||  =  0. 


In  fact,  this  result  can  be  established  for  the  Polak-Ribiere  method  under  the  assumption 
that  /  is  strongly  convex  and  that  an  exact  line  search  is  used. 

For  general  (nonconvex)  functions,  however,  is  it  not  possible  to  prove  a  result  like 
Theorem  5.7  for  Algorithm  PR.  This  fact  is  unexpected,  since  the  Polak-Ribiere  method 
performs  better  in  practice  than  the  Fletcher-Reeves  method.  The  following  surprising  result 
shows  that  the  Polak-Ribiere  method  can  cycle  infinitely  without  approaching  a  solution 
point,  even  if  an  ideal  line  search  is  used.  (By  “ideal”  we  mean  that  line  search  returns  a 
value  ak  that  is  the  first  positive  stationary  point  for  the  function  t(a )  =  f(xk  +  &Pk)-) 

Theorem  5.8. 

Consider  the  Polak-Ribi&re  method  method  (5.44)  with  an  ideal  line  search.  There  exists 
a  twice  continuously  differentiable  objective function  f  :  R3  ->  Hand  a  starting  point  xq  e  R3 
such  that  the  sequence  of  gradients  { ||  V/).  || }  is  bounded  away  from  zero. 

The  proof  of  this  result,  given  in  [253],  is  quite  complex.  It  demonstrates  the  existence 
of  the  desired  objective  function  without  actually  constructing  this  function  explicitly.  The 
result  is  interesting,  since  the  step  length  assumed  in  the  proof — the  first  stationary  point — 
may  be  accepted  by  any  of  the  practical  line  search  algorithms  currently  in  use.  The  proof 
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of  Theorem  5.8  requires  that  some  consecutive  search  directions  become  almost  negatives 
of  each  other.  In  the  case  of  ideal  line  searches,  this  happens  only  if  ft  <  0,  so  the  analysis 
suggests  Algorithm  PR+  (see  (5.45)),  in  which  we  reset  ft  to  zero  whenever  it  becomes 
negative.  We  mentioned  earlier  that  a  line  search  strategy  based  on  a  slight  modification  of 
the  Wolfe  conditions  guarantees  that  all  search  directions  generated  by  Algorithm  PR+  are 
descent  directions.  Using  these  facts,  it  is  possible  to  a  prove  global  convergence  result  like 
Theorem  5.7  for  Algorithm  PR+.  An  attractive  property  of  the  formulae  (5.49),  (5.50)  is 
that  global  convergence  can  be  established  without  introducing  any  modification  to  a  line 
search  based  on  the  Wolfe  conditions. 

NUMERICAL  PERFORMANCE 

Table  5.1  illustrates  the  performance  of  Algorithms  FR,  PR,  and  PR+  without  restarts. 
For  these  tests,  the  parameters  in  the  strong  Wolfe  conditions  (5.43)  were  chosen  to  be 
Ci  =  10-4  and  Ci  —  0.1.  The  iterations  were  terminated  when 

l|V/*||oo  <  10"5(1  +  \fk\). 

If  this  condition  was  not  satisfied  after  10,000  iterations,  we  declare  failure  (indicated  by  a 
*  in  the  table). 

The  final  column,  headed  “mod,”  indicates  the  number  of  iterations  of  Algorithm  PR+ 
for  which  the  adjustment  (5.45)  was  needed  to  ensure  that  ftR  >  0.  Algorithm  FR  on  problem 
GENROS  takes  very  short  steps  far  from  the  solution  that  lead  to  tiny  improvements  in  the 
objective  function,  and  convergence  was  not  achieved  within  the  maximum  number  of 
iterations. 

The  Polak-Ribiere  algorithm,  or  its  variation  PR+,  are  not  always  more  efficient 
than  Algorithm  FR,  and  it  has  the  slight  disadvantage  of  requiring  one  more  vector  of 
storage.  Nevertheless,  we  recommend  that  users  choose  Algorithm  PR,  PR+  or  FR-PR,  or 
the  methods  based  on  (5.49)  and  (5.50). 


Table  5.1  Iterations  and  function/gradient  evaluations  required  by  three 
nonlinear  conjugate  gradient  methods  on  a  set  of  test  problems;  see  [123] 


Problem 

n 

Alg  FR 
it/f-g 

Alg  PR 
it/f-g 

Alg  PR-f¬ 
it/ f-g 

mod 

CALCVAR3 

200 

2808/5617 

2631/5263 

2631/5263 

0 

GENROS 

500 

* 

1068/2151 

1067/2149 

1 

XPOWSING 

1000 

533/1102 

212/473 

97/229 

3 

TRIDIA1 

1000 

264/531 

262/527 

262/527 

0 

MSQRT1 

1000 

422/849 

113/231 

113/231 

0 

XPOWELL 

1000 

568/1175 

212/473 

97/229 

3 

TRIGON 

1000 

231/467 

40/92 

40/92 

0 
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NOTES  AND  REFERENCES 

The  conjugate  gradient  method  was  developed  in  the  1950s  by  Hestenes  and 
Stiefel  [168]  as  an  alternative  to  factorization  methods  for  finding  solutions  of  symmet¬ 
ric  positive  definite  systems.  It  was  not  until  some  years  later,  in  one  of  the  most  important 
developments  in  sparse  linear  algebra,  that  this  method  came  to  be  viewed  as  an  iterative 
method  that  could  give  good  approximate  solutions  to  systems  in  many  fewer  than  n  steps. 
Our  presentation  of  the  linear  conjugate  gradient  method  follows  that  of  Luenberger  [  195] . 
For  a  history  of  the  development  of  the  conjugate  gradient  and  Lanczos  methods  see  Golub 
and  O’Leary  [135]. 

Interestingly  enough,  the  nonlinear  conjugate  gradient  method  of  Fletcher  and 
Reeves  [107]  was  proposed  after  the  linear  conjugate  gradient  method  had  fallen  out  of 
favor,  but  several  years  before  it  was  rediscovered  as  an  iterative  method  for  linear  systems. 
The  Polak-Ribiere  method  was  introduced  in  [237],  and  the  example  showing  that  it  may 
fail  to  converge  on  nonconvex  problems  is  given  by  Powell  [253].  Restart  procedures  are 
discussed  in  Powell  [248]. 

Hager  and  Zhang  [161]  report  some  of  the  best  computational  results  obtained  to  date 
with  a  nonlinear  CG  method.  Their  implementation  is  based  on  formula  (5.50)  and  uses 
a  high-accuracy  line  search  procedure.  The  results  in  Table  5.1  are  taken  from  Gilbert  and 
Nocedal  [123].  This  paper  also  describes  a  line  search  that  guarantees  that  Algorithm  PR+ 
always  generates  descent  directions  and  proves  global  convergence. 

Analysis  due  to  Powell  [245]  provides  further  evidence  of  the  inefficiency  of  the 
Fletcher-Reeves  method  using  exact  line  searches.  He  shows  that  if  the  iterates  enter  a 
region  in  which  the  function  is  the  two-dimensional  quadratic 

fix)  —  \xTx, 

then  the  angle  between  the  gradient  V  fa  and  the  search  direction  pi  stays  constant.  Since 
this  angle  can  be  arbitrarily  close  to  90°,  the  Fletcher-Reeves  method  can  be  slower  than 
the  steepest  descent  method.  The  Polak-Ribiere  method  behaves  quite  differently  in  these 
circumstances:  If  a  very  small  step  is  generated,  the  next  search  direction  tends  to  the  steepest 
descent  direction,  as  argued  above.  This  feature  prevents  a  sequence  of  tiny  steps. 

The  global  convergence  of  nonlinear  conjugate  gradient  methods  has  received  much 
attention;  see  for  example  Al-Baali  [3],  Gilbert  and  Nocedal  [123],  Dai  and  Yuan  [85],  and 
Hager  and  Zhang  [161].  For  recent  surveys  on  CG  methods  see  Gould  et  al.  [147]  and  Hager 
and  Zhang  [162]. 

Most  of  the  theory  on  the  rate  of  convergence  of  conjugate  gradient  methods  assumes 
that  the  line  search  is  exact.  Crowder  and  Wolfe  [82]  show  that  the  rate  of  convergence 
is  linear,  and  show  by  constructing  an  example  that  Q-superlinear  convergence  is  not 
achievable.  Powell  [245]  studies  the  case  in  which  the  conjugate  gradient  method  enters  a 
region  where  the  objective  function  is  quadratic,  and  shows  that  either  finite  termination 
occurs  or  the  rate  of  convergence  is  linear.  Cohen  [63]  and  Burmeister  [45]  prove  n-step 
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quadratic  convergence  (5.5 1 )  for  general  objective  functions.  Ritter  [265]  shows  that  in  fact, 
the  rate  is  superquadratic,  that  is, 


ll**+n  -  **11  =  o(ll**  -  -t*||2)- 

Powell  [251]  gives  a  slightly  better  result  and  performs  numerical  tests  on  small  problems 
to  measure  the  rate  observed  in  practice.  He  also  summarizes  rate-of-convergence  results 
for  asymptotically  exact  line  searches,  such  as  those  obtained  by  Baptist  and  Stoer  [11] 
and  Stoer  [282],  Even  faster  rates  of  convergence  can  be  established  (see  Schuller  [278], 
Ritter  [265]),  under  the  assumption  that  the  search  directions  are  uniformly  linearly 
independent,  but  this  assumption  is  hard  to  verify  and  does  not  often  occur  in  practice. 

Nemirovsky  and  Yudin  [225]  devote  some  attention  to  the  global  efficiency  of  the 
Fletcher-Reeves  and  Polak-Ribiere  methods  with  exact  line  searches.  For  this  purpose  they 
define  a  measure  of  “laboriousness”  and  an  “optimal  bound”  for  it  among  a  certain  class 
of  iterations.  They  show  that  on  strongly  convex  problems  not  only  do  the  Fletcher-Reeves 
and  Polak-Ribiere  methods  fail  to  attain  the  optimal  bound,  but  they  may  also  be  slower 
than  the  steepest  descent  method.  Subsequently,  Nesterov  [225]  presented  an  algorithm  that 
attains  this  optimal  bound.  It  is  related  to  PARTAN,  the  method  of  parallel  tangents  (see,  for 
example,  Luenberger  [  195] ).  We  feel  that  this  approach  is  unlikely  to  be  effective  in  practice, 
but  no  conclusive  investigation  has  been  carried  out,  to  the  best  of  our  knowledge. 


i#7  Exercises 

i#7  5.1  Implement  Algorithm  5.2  and  use  to  it  solve  linear  systems  in  which  A  is  the 

Hilbert  matrix,  whose  elements  are  A,-j  —  1  /(/  +  j  —  1).  Set  the  right-hand-side  to 
b  —  (1,  1, ... ,  l)r  and  the  initial  point  to  xq  —  0.  Try  dimensions  n  =  5,  8,  12,  20  and 
report  the  number  of  iterations  required  to  reduce  the  residual  below  10-6. 

5.2  Show  that  if  the  nonzero  vectors  po,  pi, . . . ,  pi  satisfy  (5.5),  where  A  is  symmetric 
and  positive  definite,  then  these  vectors  are  linearly  independent.  (This  result  implies  that 
A  has  at  most  n  conjugate  directions.) 

i#7  5.3  Verify  the  formula  (5.7). 

def 

i#7  5.4  Show  that  if  f(x)  is  a  strictly  convex  quadratic,  then  the  function  h(a)  = 

f(x o  +  cr0p0  +  •  •  •  +  (J/c-iPk-i )  also  is  a  strictly  convex  quadratic  in  the  variable  a  — 
(®b>  o\ Ok-\)T ■ 

i#7  5.5  Verify  from  the  formulae  (5.14)  that  (5.17)  and  (5.18)  hold  for  k  =  1. 

i#7  5.6  Show  that  (5.24d)  is  equivalent  to  (5.14d). 
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&  5.7  Let  {A.,- ,  Vi)  i  —  1,2,  ...  ,n  be  the  eigenpairs  of  the  symmetric  matrix  A.  Show 

that  the  eigenvalues  and  eigenvectors  of  [  I  +  Pk  ( A )  A  ] 7  A  [  I  +  Pk  ( A )  A  ]  are  X,  [  1  +  Xt  (1; )  ] 2 
and  Vi ,  respectively. 

5.8  Construct  matrices  with  various  eigenvalue  distributions  (clustered  and  non¬ 
clustered)  and  apply  the  CG  method  to  them.  Comment  on  whether  the  behavior  can  be 
explained  from  Theorem  5.5. 

5.9  Derive  Algorithm  5.3  by  applying  the  standard  CG  method  in  the  variables  x  and 
then  transforming  back  into  the  original  variables. 

5.10  Verify  the  modified  conjugacy  condition  (5.40). 

5.11  Show  that  when  applied  to  a  quadratic  function,  with  exact  line  searches,  both 
the  Polak-Ribiere  formula  given  by  (5.44)  and  the  Hestenes-Stiefel  formula  given  by  (5.46) 
reduce  to  the  Fletcher-Reeves  formula  (5.41a). 

5.12  Prove  that  Lemma  5.6  holds  for  any  choice  of  Pk  satisfying  |/J*|  <  /3™. 


Chapter 


Quasi-Newton 

Methods 


In  the  mid  1950s,  W.C.  Davidon,  a  physicist  working  at  Argonne  National  Laboratory, 
was  using  the  coordinate  descent  method  (see  Section  9.3)  to  perform  a  long  optimization 
calculation.  At  that  time  computers  were  not  very  stable,  and  to  Davidon’s  frustration, 
the  computer  system  would  always  crash  before  the  calculation  was  finished.  So  Davidon 
decided  to  find  a  way  of  accelerating  the  iteration.  The  algorithm  he  developed — the  first 
quasi-Newton  algorithm — turned  out  to  be  one  of  the  most  creative  ideas  in  nonlinear 
optimization.  It  was  soon  demonstrated  by  Fletcher  and  Powell  that  the  new  algorithm 
was  much  faster  and  more  reliable  than  the  other  existing  methods,  and  this  dramatic 
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advance  transformed  nonlinear  optimization  overnight.  During  the  following  twenty  years, 
numerous  variants  were  proposed  and  hundreds  of  papers  were  devoted  to  their  study.  An 
interesting  historical  irony  is  that  Davidon’s  paper  [87]  was  not  accepted  for  publication;  it 
remained  as  a  technical  report  for  more  than  thirty  years  until  it  appeared  in  the  first  issue 
of  the  SIAM  Journal  on  Optimization  in  1991  [88]. 

Quasi-Newton  methods,  like  steepest  descent,  require  only  the  gradient  of  the  ob¬ 
jective  function  to  be  supplied  at  each  iterate.  By  measuring  the  changes  in  gradients,  they 
construct  a  model  of  the  objective  function  that  is  good  enough  to  produce  superlinear 
convergence.  The  improvement  over  steepest  descent  is  dramatic,  especially  on  difficult 
problems.  Moreover,  since  second  derivatives  are  not  required,  quasi-Newton  methods  are 
sometimes  more  efficient  than  Newton’s  method.  Today,  optimization  software  libraries 
contain  a  variety  of  quasi-Newton  algorithms  for  solving  unconstrained,  constrained,  and 
large-scale  optimization  problems.  In  this  chapter  we  discuss  quasi-Newton  methods  for 
small  and  medium- sized  problems,  and  in  Chapter  7  we  consider  their  extension  to  the 
large-scale  setting. 

The  development  of  automatic  differentiation  techniques  has  made  it  possible  to  use 
Newton’s  method  without  requiring  users  to  supply  second  derivatives;  see  Chapter  8. 
Still,  automatic  differentiation  tools  may  not  be  applicable  in  many  situations,  and  it 
may  be  much  more  costly  to  work  with  second  derivatives  in  automatic  differentia¬ 
tion  software  than  with  the  gradient.  For  these  reasons,  quasi-Newton  methods  remain 
appealing. 


6.1  THE  BFGS  METHOD 


The  most  popular  quasi-Newton  algorithm  is  the  BFGS  method,  named  for  its  discoverers 
Broyden,  Fletcher,  Goldfarb,  and  Shanno.  In  this  section  we  derive  this  algorithm  (and 
its  close  relative,  the  DFP  algorithm)  and  describe  its  theoretical  properties  and  practical 
implementation. 

We  begin  the  derivation  by  forming  the  following  quadratic  model  of  the  objective 
function  at  the  current  iterate  xk: 


mk{p )  =  fk  +  V  p  +  \pT  Bkp.  (6.1) 

Here  Bk  is  an  n  x  n  symmetric  positive  definite  matrix  that  will  be  revised  or  updated  at 
every  iteration.  Note  that  the  function  value  and  gradient  of  this  model  at  p  =  0  match 
fk  and  V/t,  respectively.  The  minimizer  pk  of  this  convex  quadratic  model,  which  we  can 
write  explicitly  as 


Pk -- -Bf'Vfk, 


(6.2) 
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is  used  as  the  search  direction,  and  the  new  iterate  is 


Xk+i  =xk  +  akpk,  (6.3) 

where  the  step  length  ak  is  chosen  to  satisfy  the  Wolfe  conditions  (3.6).  This  iteration  is 
quite  similar  to  the  line  search  Newton  method;  the  key  difference  is  that  the  approximate 
Hessian  Bk  is  used  in  place  of  the  true  Hessian. 

Instead  of  computing  B k  afresh  at  every  iteration,  Davidon  proposed  to  update  it  in  a 
simple  manner  to  account  for  the  curvature  measured  during  the  most  recent  step.  Suppose 
that  we  have  generated  a  new  iterate  xk+\  and  wish  to  construct  a  new  quadratic  model,  of 
the  form 


mk+l(p)  =  /*+i  +  V/4iP  +  \pTBk+lp. 

What  requirements  should  we  impose  on  Bk+l,  based  on  the  knowledge  gained  during 
the  latest  step?  One  reasonable  requirement  is  that  the  gradient  of  mk+  i  should  match  the 
gradient  of  the  objective  function  /  at  the  latest  two  iterates  xk  and  xk+k.  Since  Vmk+k  (0)  is 
precisely  V  fk+i,  the  second  of  these  conditions  is  satisfied  automatically.  The  first  condition 
can  be  written  mathematically  as 


Vmk+1{-akpk)  —  V/fc+i  -  akBk+ipk  —  V  fk. 


By  rearranging,  we  obtain 


Bk+\otkpk  =  V  fk+i-  V/*. 


(6.4) 


To  simplify  the  notation  it  is  useful  to  define  the  vectors 


Sk  =  xk+i  -  xk  =  otkpk,  yk  -  V/*+1  -  V/*,  (6.5) 


so  that  (6.4)  becomes 


Bk+\Sk  —  yk. 


(6.6) 


We  refer  to  this  formula  as  the  secant  equation. 

Given  the  displacement  sk  and  the  change  of  gradients  yk,  the  secant  equation  requires 
that  the  symmetric  positive  definite  matrix  Bk+\  map  sk  into  yk.  This  will  be  possible  only 
if  sk  and  yk  satisfy  the  curvature  condition 

sl yt  >  0,  (6.7) 

as  is  easily  seen  by  premultiplying  (6.6)  bys'J.  When  /  is  strongly  convex,  the  inequality  (6.7) 
will  be  satisfied  for  any  two  points  xk  and  xk+k  (see  Exercise  6.1).  However,  this  condition 
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will  not  always  hold  for  nonconvex  functions,  and  in  this  case  we  need  to  enforce  (6.7) 
explicitly,  by  imposing  restrictions  on  the  line  search  procedure  that  chooses  the  step  length 
a.  In  fact,  the  condition  (6.7)  is  guaranteed  to  hold  if  we  impose  the  Wolfe  (3.6)  or  strong 
Wolfe  conditions  (3.7)  on  the  line  search.  To  verify  this  claim,  we  note  from  (6.5)  and  (3.6b) 
that  V  fk+lsk  >  c2V  ffsk,  and  therefore 

ylsk  >  (c2  -  l)a*V// pk.  (6.8) 

Since  C2  <  1  and  since  pk  is  a  descent  direction,  the  term  on  the  right  is  positive,  and  the 
curvature  condition  (6.7)  holds. 

When  the  curvature  condition  is  satisfied,  the  secant  equation  (6.6)  always  has  a 
solution  Bk+ 1.  In  fact,  it  admits  an  infinite  number  of  solutions,  since  the  n(n  +  l)/2 
degrees  of  freedom  in  a  symmetric  positive  definite  matrix  exceed  the  n  conditions  imposed 
by  the  secant  equation.  The  requirement  of  positive  definiteness  imposes  n  additional 
inequalities — all  principal  minors  must  be  positive — but  these  conditions  do  not  absorb  the 
remaining  degrees  of  freedom. 

To  determine  Bk+\  uniquely,  we  impose  the  additional  condition  that  among  all 
symmetric  matrices  satisfying  the  secant  equation,  Bk+ 1  is,  in  some  sense,  closest  to  the  current 
matrix  Bk.  In  other  words,  we  solve  the  problem 


min  || B  —  Bk\\  (6.9a) 

B 

subject  to  B  —  Bt ,  Bsk  —  yk,  (6.9b) 


where  ,s>  and  yk  satisfy  (6.7)  and  Bk  is  symmetric  and  positive  definite.  Different  matrix 
norms  can  be  used  in  (6.9a),  and  each  norm  gives  rise  to  a  different  quasi-Newton  method. 
A  norm  that  allows  easy  solution  of  the  minimization  problem  (6.9)  and  gives  rise  to  a 
scale-invariant  optimization  method  is  the  weighted  Frobenius  norm 


\\A\\W  =  \\W1/2AW1/2\\f,  (6.10) 

where  ||  •  ||f  is  defined  by  ||C|||-  =  12'j= i  ci q-  The  weight  matrix  W  can  be  chosen  as 

any  matrix  satisfying  the  relation  Wyk  —  sk .  For  concreteness,  the  reader  can  assume  that 
W  =  Gf1  where  Gk  is  the  average  Hessian  defined  by 


The  property 


Gk  = 


f 


V2.f(xk  +  rakpk)dr 


yk  —  Gkotkpk  —  Gksk 


(6.11) 


(6.12) 


follows  from  Taylor’s  theorem,  Theorem  2.1.  With  this  choice  of  weighting  matrix  W,  the 
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norm  (6.10)  is  non-dimensional,  which  is  a  desirable  property,  since  we  do  not  wish  the 
solution  of  (6.9)  to  depend  on  the  units  of  the  problem. 

With  this  weighting  matrix  and  this  norm,  the  unique  solution  of  (6.9)  is 

(DFP)  Bk+ 1  =  (/  -  pkyksl)  Bk  ( /  -  pkskyk)  +  pkykyk ,  (6.13) 


with 

Pk  =  (6-14) 

yk  sk 

This  formula  is  called  the  DFP  updating  formula,  since  it  is  the  one  originally  proposed  by 
Davidon  in  1959,  and  subsequently  studied,  implemented,  and  popularized  by  Fletcher  and 
Powell. 

The  inverse  of  Bk,  which  we  denote  by 

Hk  =  Bk\ 


is  useful  in  the  implementation  of  the  method,  since  it  allows  the  search  direction  (6.2) 
to  be  calculated  by  means  of  a  simple  matrix-vector  multiplication.  Using  the  Sherman- 
Morrison- Woodbury  formula  (A.28),  we  can  derive  the  following  expression  for  the  update 
of  the  inverse  Flessian  approximation  Hk  that  corresponds  to  the  DFP  update  of  Bk  in  (6. 13): 


(DFP) 


Hk+l  =  Hk 


Hkykyk  Hk  sks£ 
yTk  Hkyk  yksk ' 


(6.15) 


Note  that  the  last  two  terms  in  the  right-hand-side  of  (6. 15)  are  rank-one  matrices,  so  that  Hk 
undergoes  a  rank-two  modification.  It  is  easy  to  see  that  (6. 1 3 )  is  also  a  rank- two  modification 
of  Bk.  This  is  the  fundamental  idea  of  quasi-Newton  updating:  Instead  of  recomputing  the 
approximate  Flessians  (or  inverse  Hessians)  from  scratch  at  every  iteration,  we  apply  a  simple 
modification  that  combines  the  most  recently  observed  information  about  the  objective 
function  with  the  existing  knowledge  embedded  in  our  current  Hessian  approximation. 

The  DFP  updating  formula  is  quite  effective,  but  it  was  soon  superseded  by  the  BFGS 
formula,  which  is  presently  considered  to  be  the  most  effective  of  all  quasi-Newton  updating 
formulae.  BFGS  updating  can  be  derived  by  making  a  simple  change  in  the  argument 
that  led  to  (6.13).  Instead  of  imposing  conditions  on  the  Hessian  approximations  Bk,  we 
impose  similar  conditions  on  their  inverses  Hk.  The  updated  approximation  Hk+l  must  be 
symmetric  and  positive  definite,  and  must  satisfy  the  secant  equation  (6.6),  now  written  as 


Hk+i  yk  —  sk- 


The  condition  of  closeness  to  Hk  is  now  specified  by  the  following  analogue  of  (6.9): 

min  II H  —  Hk 

H 

subject  to  H  =  HT , 


Hyk  —  sk. 


(6.16a) 

(6.16b) 
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The  norm  is  again  the  weighted  Frobenius  norm  described  above,  where  the  weight  matrix 
W  is  now  any  matrix  satisfying  Wsk  —  yk.  (For  concreteness,  we  assume  again  that  W  is 
given  by  the  average  Flessian  Gk  defined  in  (6.11).)  The  unique  solution  Hk+\  to  (6.16)  is 
given  by 


(BFGS)  Hk+ 1  =  (/  -  pkskyk )Hk{I  -  pkyksk)  +  pksksk,  (6.17) 

with  pk  defined  by  (6.14). 

Just  one  issue  has  to  be  resolved  before  we  can  define  a  complete  BFGS  algorithm:  Flow 
should  we  choose  the  initial  approximation  7/0?  Unfortunately,  there  is  no  magic  formula 
that  works  well  in  all  cases.  We  can  use  specific  information  about  the  problem,  for  instance 
by  setting  it  to  the  inverse  of  an  approximate  Hessian  calculated  by  finite  differences  at  xq. 
Otherwise,  we  can  simply  set  it  to  be  the  identity  matrix,  or  a  multiple  of  the  identity  matrix, 
where  the  multiple  is  chosen  to  reflect  the  scaling  of  the  variables. 

Algorithm  6.1  (BFGS  Method). 

Given  starting  point  xq,  convergence  tolerance  e  >  0, 
inverse  Hessian  approximation  H0; 
k  <-  0; 

while  ||  V _/*  ||  >  e; 

Compute  search  direction 


pk  =  - HkVfk ; 


(6.18) 


Set  Xk+ 1  =  Xk  +  ak Pk  where  ak  is  computed  from  a  line  search 
procedure  to  satisfy  the  Wolfe  conditions  (3.6); 

Define  sk  —  xk+l  -  xk  and  yk  =  V  fk+l  -  V/*; 

Compute  Hk+i  by  means  of  (6.17); 
k  < —  k  - FI; 
end  (while) 

Each  iteration  can  be  performed  at  a  cost  of  O  («2 )  arithmetic  operations  (plus  the  cost 
of  function  and  gradient  evaluations);  there  are  no  0{n 3)  operations  such  as  linear  system 
solves  or  matrix-matrix  operations.  The  algorithm  is  robust,  and  its  rate  of  convergence  is 
superlinear,  which  is  fast  enough  for  most  practical  purposes.  Even  though  Newton’s  method 
converges  more  rapidly  (that  is,  quadratically),  its  cost  per  iteration  usually  is  higher,  because 
of  its  need  for  second  derivatives  and  solution  of  a  linear  system. 

We  can  derive  a  version  of  the  BFGS  algorithm  that  works  with  the  Hessian  approx¬ 
imation  Bk  rather  than  Hk.  The  update  formula  for  Bk  is  obtained  by  simply  applying  the 
Sherman-Morrison-Woodbury  formula  (A.28)  to  (6.17)  to  obtain 


(BFGS) 


Bk+  i  =  Bk  — 


BkskslBk  ykyk 

sk  Bksk  ykSk 


(6.19) 
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A  naive  implementation  of  this  variant  is  not  efficient  for  unconstrained  minimization, 
because  it  requires  the  system  B k pk  —  — V/*  to  be  solved  for  the  step  pk,  thereby  increasing 
the  cost  of  the  step  computation  to  0(n3).  We  discuss  later,  however,  that  less  expensive 
implementations  of  this  variant  are  possible  by  updating  Cholesky  factors  of  B k. 


PROPERTIES  OF  THE  BFGS  METHOD 

It  is  usually  easy  to  observe  the  superlinear  rate  of  convergence  of  the  BFGS  method 
on  practical  problems.  Below,  we  report  the  last  few  iterations  of  the  steepest  descent,  BFGS, 
and  an  inexact  Newton  method  on  Rosenbrock’s  function  (2.22).  The  table  gives  the  value  of 
||  jt*  —  x*  || .  The  Wolfe  conditions  were  imposed  on  the  step  length  in  all  three  methods.  From 
the  starting  point  (—1.2,  1),  the  steepest  descent  method  required  5264  iterations,  whereas 
BFGS  and  Newton  took  only  34  and  21  iterations,  respectively  to  reduce  the  gradient  norm 
to  1(T5. 


steepest 

descent 

BFGS 

Newton 

1.827e-04 

1.70e-03 

3.48e-02 

1.826e-04 

1.17e-03 

1.44e-02 

1.824e-04 

1.34e-04 

1.82e-04 

1.823e-04 

1.01e-06 

1.17e-08 

A  few  points  in  the  derivation  of  the  BFGS  and  DFP  methods  merit  further  discussion. 
Note  that  the  minimization  problem  (6.16)  that  gives  rise  to  the  BFGS  update  formula  does 
not  explicitly  require  the  updated  Hessian  approximation  to  be  positive  definite.  It  is  easy  to 
show,  however,  that  Hk+l  will  be  positive  definite  whenever  H k  is  positive  definite,  by  using 
the  following  argument.  First,  note  from  (6.8)  that  y[ sk  is  positive,  so  that  the  updating 
formula  (6.17),  (6.14)  is  well-defined.  For  any  nonzero  vector  z,  we  have 

zrHk+  iz  =  wT Hkw  +  pk(zTsk)2  >  0, 


where  we  have  defined  w  —  z  —  pkyk(s£z)-  The  righthand  side  can  be  zero  only  if  s£z  —  0, 
but  in  this  case  w  —  z  yF  0,  which  implies  that  the  first  term  is  greater  than  zero.  Therefore, 
Hk+ 1  is  positive  definite. 

To  make  quasi-Newton  updating  formulae  invariant  to  transformations  in  the  vari¬ 
ables  (such  as  scaling  transformations),  it  is  necessary  for  the  objectives  (6.9a)  and  (6.16a) 
to  be  invariant  under  the  same  transformations.  The  choice  of  the  weighting  matrices  W 
used  to  define  the  norms  in  (6.9a)  and  (6.16a)  ensures  that  this  condition  holds.  Many  other 
choices  of  the  weighting  matrix  W  are  possible,  each  one  of  them  giving  a  different  update 
formula.  However,  despite  intensive  searches,  no  formula  has  been  found  that  is  significantly 
more  effective  than  BFGS. 
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The  BFGS  method  has  many  interesting  properties  when  applied  to  quadratic  func¬ 
tions.  We  discuss  these  properties  later  in  the  more  general  context  of  the  Broyden  family  of 
updating  formulae,  of  which  BFGS  is  a  special  case. 

It  is  reasonable  to  ask  whether  there  are  situations  in  which  the  updating  formula  such 
as  (6.17)  can  produce  bad  results.  If  at  some  iteration  the  matrix  Hk  becomes  a  poor  approx¬ 
imation  to  the  true  inverse  Hessian,  is  there  any  hope  of  correcting  it?  For  example,  when 
the  inner  product  y[sk  is  tiny  (but  positive),  then  it  follows  from  (6.14),  (6.17)  that  Hk+i 
contains  very  large  elements.  Is  this  behavior  reasonable?  A  related  question  concerns  the 
rounding  errors  that  occur  in  finite-precision  implementation  of  these  methods.  Can  these 
errors  grow  to  the  point  of  erasing  all  useful  information  in  the  quasi-Newton  approximate 
Hessian? 

These  questions  have  been  studied  analytically  and  experimentally,  and  it  is  now 
known  that  the  BFGS  formula  has  very  effective  self-correcting  properties.  If  the  matrix  Hk 
incorrectly  estimates  the  curvature  in  the  objective  function,  and  if  this  bad  estimate  slows 
down  the  iteration,  then  the  Hessian  approximation  will  tend  to  correct  itself  within  a  few 
steps.  It  is  also  known  that  the  DFP  method  is  less  effective  in  correcting  bad  Hessian  approx¬ 
imations;  this  property  is  believed  to  be  the  reason  for  its  poorer  practical  performance.  The 
self-correcting  properties  of  BFGS  hold  only  when  an  adequate  line  search  is  performed.  In 
particular,  the  Wolfe  line  search  conditions  ensure  that  the  gradients  are  sampled  at  points 
that  allow  the  model  (6.1)  to  capture  appropriate  curvature  information. 

It  is  interesting  to  note  that  the  DFP  and  BFGS  updating  formulae  are  duals  of  each 
other,  in  the  sense  that  one  can  be  obtained  from  the  other  by  the  interchanges  s  -o-  y, 
B  <r>  H.  This  symmetry  is  not  surprising,  given  the  manner  in  which  we  derived  these 
methods  above. 

IMPLEMENTATION 

A  few  details  and  enhancements  need  to  be  added  to  Algorithm  6.1  to  produce  an 
efficient  implementation.  The  line  search,  which  should  satisfy  either  the  Wolfe  conditions 

(3.6)  or  the  strong  Wolfe  conditions  (3.7),  should  always  try  the  step  length  ak  —  1  first, 
because  this  step  length  will  eventually  always  be  accepted  (under  certain  conditions),  thereby 
producing  superlinear  convergence  of  the  overall  algorithm.  Computational  observations 
strongly  suggest  that  it  is  more  economical,  in  terms  of  function  evaluations,  to  perform 
a  fairly  inaccurate  line  search.  The  values  c i  =  10”4  and  C2  =  0.9  are  commonly  used  in 

(3.6) . 

As  mentioned  earlier,  the  initial  matrix  H0  often  is  set  to  some  multiple  ft  I  of  the 
identity,  but  there  is  no  good  general  strategy  for  choosing  the  multiple  /S.  If  /S  is  too  large, 
so  that  the  first  step  po  —  —fgo  is  too  long,  many  function  evaluations  may  be  required  to 
find  a  suitable  value  for  the  step  length  ov  Some  software  asks  the  user  to  prescribe  a  value 
S  for  the  norm  of  the  first  step,  and  then  set  Hq  —  S||goll-1^  to  achieve  this  norm. 

A  heuristic  that  is  often  quite  effective  is  to  scale  the  starting  matrix  after  the  first 
step  has  been  computed  but  before  the  first  BFGS  update  is  performed.  We  change  the 


6.1.  The  BFGS  Method  143 


provisional  value  H0  —  /  by  setting 


H0  <-  (6.20) 

yk  yk 

before  applying  the  update  (6.14)  ,  (6.17)  to  obtain  H\.  This  formula  attempts  to  make  the 
size  of  Hq  similar  to  that  of  V2/(xo)-1,  in  the  following  sense.  Assuming  that  the  average 
Hessian  defined  in  (6.11)  is  positive  definite,  there  exists  a  square  root  Gk  satisfying 
Gk  —  Gk  Gk  (see  Exercise  6.6).  Therefore,  by  defining  zk  —  Gk  sk  and  using  the  relation 
(6.12),  we  have 


yksk  _  ( Glk/25k)TGlk/2sk  _  zTkZk 
yTkyk  {G^skYGkGfsk  4Gkzk 

The  reciprocal  of  (6.21)  is  an  approximation  to  one  of  the  eigenvalues  of  Gk,  which  in  turn 
is  close  to  an  eigenvalue  of  V2  f{xk).  Hence,  the  quotient  (6.21)  itself  approximates  an 
eigenvalue  of  V2  f(xk)~l .  Other  scaling  factors  can  be  used  in  (6.20),  but  the  one  presented 
here  appears  to  be  the  most  successful  in  practice. 

In  (6.19)  we  gave  an  update  formula  for  a  BFGS  method  that  works  with  the  Hes¬ 
sian  approximation  Bk  instead  of  the  the  inverse  Hessian  approximation  Hk.  An  efficient 
implementation  of  this  approach  does  not  store  Bk  explicitly,  but  rather  the  Cholesky  fac¬ 
torization  LkDkLl  of  this  matrix.  A  formula  that  updates  the  factors  Lk  and  Dk  directly  in 
0{n2)  operations  can  be  derived  from  (6.19).  Since  the  linear  system  Bkpk  =  — V/*  also 
can  be  solved  in  0{n2)  operations  (by  performing  triangular  substitutions  with  Lk  and  L l 
and  a  diagonal  substitution  with  Dk),  the  total  cost  is  quite  similar  to  the  variant  described 
in  Algorithm  6.1.  A  potential  advantage  of  this  alternative  strategy  is  that  it  gives  us  the 
option  of  modifying  diagonal  elements  in  the  Dk  factor  if  they  are  not  sufficiently  large,  to 
prevent  instability  when  we  divide  by  these  elements  during  the  calculation  of  pk.  However, 
computational  experience  suggests  no  real  advantages  for  this  variant,  and  we  prefer  the 
simpler  strategy  of  Algorithm  6.1. 

The  performance  of  the  BFGS  method  can  degrade  if  the  line  search  is  not  based 
on  the  Wolfe  conditions.  For  example,  some  software  implements  an  Armijo  backtracking 
line  search  (see  Section  3.1):  The  unit  step  length  ak  =  1  is  tried  first  and  is  successively 
decreased  until  the  sufficient  decrease  condition  (3.6a)  is  satisfied.  For  this  strategy,  there  is 
no  guarantee  that  the  curvature  condition  y[ sk  >  0  (6.7)  will  be  satisfied  by  the  chosen  step, 
since  a  step  length  greater  than  1  may  be  required  to  satisfy  this  condition.  To  cope  with  this 
shortcoming,  some  implementations  simply  skip  the  BFGS  update  by  setting  Hk+l  =  Hk 
when  y[ sk  is  negative  or  too  close  to  zero.  This  approach  is  not  recommended,  because 
the  updates  may  be  skipped  much  too  often  to  allow  Hk  to  capture  important  curvature 
information  for  the  objective  function  /.  In  Chapter  18  we  discuss  a  damped  BFGS  update 
that  is  a  more  effective  strategy  for  coping  with  the  case  where  the  curvature  condition  (6.7) 
is  not  satisfied. 
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6.2  THE  SRI  METHOD 


In  the  BFGS  and  DFP  updating  formulae,  the  updated  matrix  Bk+\  (or  Hk+ 1 )  differs  from  its 
predecessor  Bk  (or  Hk )  by  a  rank-2  matrix.  In  fact,  as  we  now  show,  there  is  a  simpler  rank- 1 
update  that  maintains  symmetry  of  the  matrix  and  allows  it  to  satisfy  the  secant  equation. 
Unlike  the  rank-two  update  formulae,  this  symmetric-rank- 1,  or  SRI,  update  does  not 
guarantee  that  the  updated  matrix  maintains  positive  definiteness.  Good  numerical  results 
have  been  obtained  with  algorithms  based  on  SRI,  so  we  derive  it  here  and  investigate  its 
properties. 

The  symmetric  rank- 1  update  has  the  general  form 

Bk+i  =  Bk  +  aw/, 

where  cr  is  either  + 1  or  —  1,  and  a  and  v  are  chosen  so  that  Bk+1  satisfies  the  secant  equation 
(6.6),  that  is,  yk  —  Bk+lsk.  By  substituting  into  this  equation,  we  obtain 

yk  —  Bksk  +  [avTsk\  v.  (6.22) 

Since  the  term  in  brackets  is  a  scalar,  we  deduce  that  v  must  be  a  multiple  of  yk  —  Bksk,  that 
is,  v  —  S(yk  —  Bksk )  for  some  scalar  S.  By  substituting  this  form  of  v  into  (6.22),  we  obtain 

(yk  ~  Bksk)  =  cr S2  [sj (yk  -  Bksk)]  (yk  -  Bksk),  (6.23) 

and  it  is  clear  that  this  equation  is  satisfied  if  (and  only  if)  we  choose  the  parameters  <5  and 
a  to  be 


a  —  sign  [si  (yk  -  Bksk)]  , 


S  =  ±  |sj (yk  -  Bksk)\  1/2 


Hence,  we  have  shown  that  the  only  symmetric  rank- 1  updating  formula  that  satisfies  the 
secant  equation  is  given  by 


(SRI) 


Bk+ 1  —  Bk  + 


(yk  ~  Bksk)(yk  -  Bksk)T 
(yk  -  Bksk)Tsk 


(6.24) 


By  applying  the  Sherman-Morrison  formula  (A.27),  we  obtain  the  corresponding  update 
formula  for  the  inverse  Hessian  approximation  Hk: 


(SRI) 


/  4+i  —  Hk  + 


(sk  -  Hkyk)(sk  -  Hkyk)T 
(sk  -  Hkyk)Tyk 


(6.25) 


This  derivation  is  so  simple  that  the  SRI  formula  has  been  rediscovered  a  number  of  times. 

It  is  easy  to  see  that  even  if  Bk  is  positive  definite,  Bk+\  may  not  have  the  same  property. 
(The  same  is,  of  course,  true  of  Hk.)  This  observation  was  considered  a  major  drawback 
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in  the  early  days  of  nonlinear  optimization  when  only  line  search  iterations  were  used. 
However,  with  the  advent  of  trust-region  methods,  the  SRI  updating  formula  has  proved  to 
be  quite  useful,  and  its  ability  to  generate  indefinite  Hessian  approximations  can  actually  be 
regarded  as  one  of  its  chief  advantages. 

The  main  drawback  of  SRI  updating  is  that  the  denominator  in  (6.24)  or  (6.25)  can 
vanish.  In  fact,  even  when  the  objective  function  is  a  convex  quadratic,  there  may  be  steps 
on  which  there  is  no  symmetric  rank-1  update  that  satisfies  the  secant  equation.  It  pays  to 
reexamine  the  derivation  above  in  the  light  of  this  observation. 

By  reasoning  in  terms  of  64  (similar  arguments  can  be  applied  to  64),  we  see  that 
there  are  three  cases: 

1.  If  (y*.  —  BicS^si;  ^  0,  then  the  arguments  above  show  that  there  is  a  unique 
rank-one  updating  formula  satisfying  the  secant  equation  (6.6),  and  that  it  is  given 
by  (6.24). 

2.  If  yi-  =  64  yt,  then  the  only  updating  formula  satisfying  the  secant  equation  is  simply 
Bk+  i  =  Bk. 

3.  If  yi  BkSk  and  (yi  —  BkSk)T Sk  —  0,  then  (6.23)  shows  that  there  is  no  symmetric 
rank-one  updating  formula  satisfying  the  secant  equation. 

The  last  case  clouds  an  otherwise  simple  and  elegant  derivation,  and  suggests  that  numerical 
instabilities  and  even  breakdown  of  the  method  can  occur.  It  suggests  that  rank-one  updating 
does  not  provide  enough  freedom  to  develop  a  matrix  with  all  the  desired  characteristics, 
and  that  a  rank-two  correction  is  required.  This  reasoning  leads  us  back  to  the  BFGS  method, 
in  which  positive  definiteness  (and  thus  nonsingularity)  of  all  Hessian  approximations  is 
guaranteed. 

Nevertheless,  we  are  interested  in  the  SRI  formula  for  the  following  reasons. 

(i)  A  simple  safeguard  seems  to  adequately  prevent  the  breakdown  of  the  method  and  the 
occurrence  of  numerical  instabilities. 

(ii)  The  matrices  generated  by  the  SRI  formula  tend  to  be  good  approximations  to  the 
true  Hessian  matrix — often  better  than  the  BFGS  approximations. 

(iii)  In  quasi-Newton  methods  for  constrained  problems,  or  in  methods  for  partially 
separable  functions  (see  Chapters  18  and  7),  it  may  not  be  possible  to  impose  the 
curvature  condition  y[ Sk  >  0,  and  thus  BFGS  updating  is  not  recommended.  Indeed, 
in  these  two  settings,  indefinite  Hessian  approximations  are  desirable  insofar  as  they 
reflect  indefiniteness  in  the  true  Hessian. 

We  now  introduce  a  strategy  to  prevent  the  SRI  method  from  breaking  down.  It 
has  been  observed  in  practice  that  SRI  performs  well  simply  by  skipping  the  update  if  the 
denominator  is  small.  More  specifically,  the  update  (6.24)  is  applied  only  if 

|s* (y*  “  bMa-)|  >  r\\sk\\  || y*  -  Bksk ||,  (6.26) 
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where  r  e  (0,  1)  is  a  small  number,  say  r  =  10-8.  If  (6.26)  does  not  hold,  we  set 
Bk+ 1  =  Bk .  Most  implementations  of  the  SRI  method  use  a  skipping  rule  of  this 
kind. 

Why  do  we  advocate  skipping  of  updates  for  the  SRI  method,  when  in  the  previous 
section  we  discouraged  this  strategy  in  the  case  of  BFGS?  The  two  cases  are  quite  different. 
The  condition  s[ (yk  —  B^Sk)  ~  0  occurs  infrequently,  since  it  requires  certain  vectors  to 
be  aligned  in  a  specific  way.  When  it  does  occur,  skipping  the  update  appears  to  have  no 
negative  effects  on  the  iteration.  This  is  not  surprising,  since  the  skipping  condition  im¬ 
plies  that  sf Gsk  ~  s'l B^sk,  where  G  is  the  average  Hessian  over  the  last  step — meaning 
that  the  curvature  of  B along  Sk  is  already  correct.  In  contrast,  the  curvature  condition 
slyk  >  0  required  for  BFGS  updating  may  easily  fail  if  the  line  search  does  not  im¬ 
pose  the  Wolfe  conditions  (for  example,  if  the  step  is  not  long  enough),  and  therefore 
skipping  the  BFGS  update  can  occur  often  and  can  degrade  the  quality  of  the  Hessian 
approximation. 

We  now  give  a  formal  description  of  an  SRI  method  using  a  trust-region  framework, 
which  we  prefer  over  a  line  search  framework  because  it  can  accommodate  indefinite  Hessian 
approximations  more  easily. 

Algorithm  6.2  (SRI  Trust-Region  Method). 

Given  starting  point  xq,  initial  Hessian  approximation  B0, 
trust-region  radius  A0,  convergence  tolerance  e  >  0, 
parameters  r]  e  (0,  1CT3)  and/-  e  (0,  1); 

k  •<—  0; 

while  ||  V/d|  >  e; 

Compute  Sk  by  solving  the  subproblem 

min  V//.SH — sT BkS  subject  to  ||,y||  <  A*;  (6.27) 

s  2 


Compute 


yk  —  v/(x*  +  sk)  -  v/t, 

ared  =  fk  —  f(xk  +  Sk)  (actual  reduction) 
pred  =  -  (vfl sk  +  X-s'l  Bksk 


if  ared/pred  >  i] 

%k+ 1  —  Xk  +  Sky 

else 


%k-\- 1  —  Xk'y 

end  (if) 


(predicted  reduction); 
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if  ared/pred  >  0.75 

if  ||  <  0.8 At 

At+i  —  Atj 

else 

A/..+1  —  2A 

end  (if) 

else  if  0.1  <  ared/pred  <  0.75 

A&+i  —  At  j 

else 

At+i  —  0.5At; 

end  (if) 

if  (6.26)  holds 

Use  (6.24)  to  compute  Bf+i  (even  ifjct+i  =  Xt); 

else 

Bk+i  *—  Bk\ 

end  (if) 

k  < —  k  +  1; 

end  (while) 

This  algorithm  has  the  typical  form  of  a  trust  region  method  (cf.  Algorithm  4.1).  For 
concreteness,  we  have  specified  a  particular  strategy  for  updating  the  trust  region  radius, 
but  other  heuristics  can  be  used  instead. 

To  obtain  a  fast  rate  of  convergence,  it  is  important  for  the  matrix  to  be  updated 
even  along  a  failed  direction  Sk-  The  fact  that  the  step  was  poor  indicates  that  Bk  is  an 
inadequate  approximation  of  the  true  Hessian  in  this  direction.  Unless  the  quality  of  the 
approximation  is  improved,  steps  along  similar  directions  could  be  generated  on  later 
iterations,  and  repeated  rejection  of  such  steps  could  prevent  superlinear  convergence. 

PROPERTIES  OF  SRI  UPDATING 

One  of  the  main  advantages  of  SRI  updating  is  its  ability  to  generate  good  Hessian 
approximations.  We  demonstrate  this  property  by  first  examining  a  quadratic  function.  For 
functions  of  this  type,  the  choice  of  step  length  does  not  affect  the  update,  so  to  examine  the 
effect  of  the  updates,  we  can  assume  for  simplicity  a  uniform  step  length  of  1,  that  is, 


Pk  —  -HkVfk,  xk+i  —  xk  +  Pk- 


(6.28) 


It  follows  that  pk  —  Sk . 

Theorem  6.1. 

Suppose  that  f  :  R"  ->  R  is  the  strongly  convex  quadratic  function  f(x)  —  bT x  + 
jXT  A. x,  where  A  is  symmetric  positive  definite.  Then  for  any  starting  point  xQ  and  any 
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symmetric  starting  matrix  H0,  the  iterates  [xk]  generated  by  the  SRI  method  (6.25),  (6.28) 
converge  to  the  minimizer  in  at  most  n  steps,  provided  that  (sk  —  Hkyk)T yk  Y  0  for  all  k. 
Moreover,  ifn  steps  are  performed,  and  if  the  search  directions  pt  are  linearly  independent,  then 
Hn  =  A~h 

PROOF.  Because  of  our  assumption  (sk  —  Hkyk)Tyk  Y  0,  the  SRI  update  is  always  well- 
defined.  We  start  by  showing  inductively  that 

Hkyj  —  Sj,  7=0,  1 . *  —  1.  (6.29) 

In  other  words,  we  claim  that  the  secant  equation  is  satisfied  not  only  along  the  most  recent 
search  direction,  but  along  all  previous  directions. 

By  definition,  the  SRI  update  satisfies  the  secant  equation,  so  we  have  Hiyo  =  so.  Let 
us  now  assume  that  (6.29)  holds  for  some  value  k  >  1  and  show  that  it  holds  also  for  A:  +  1. 
From  this  assumption,  we  have  from  (6.29)  that 

(sk  -  Hkyk)T yj  =  skyj  -  y[ ( Hkyj )  =  sTk  y ,  -  yksj  =  0,  all  j  <  k,  (6.30) 

where  the  last  equality  follows  because  v;  =  As;  for  the  quadratic  function  we  are  considering 
here.  By  using  (6.30)  and  the  induction  hypothesis  (6.29)  in  (6.25),  we  have 

Hk+iyj  =  Hkyj  =  sj,  for  all  j  <  k. 

Since  Hk+]  yk  =  sk  by  the  secant  equation,  we  have  shown  that  (6.29)  holds  when  k  is 
replaced  by  k  +  1.  By  induction,  then,  this  relation  holds  for  all  k. 

If  the  algorithm  performs  n  steps  and  if  these  steps  {sj}  are  linearly  independent,  we 

have 


sj  =  H„yj  =  HnAsj,  j  —  0,1, - n  -  1. 

It  follows  that  H„A  =  I,  that  is,  H„  =  A^1.  Therefore,  the  step  taken  at  xn  is  the  Newton 
step,  and  so  the  next  iterate  xn+i  will  be  the  solution,  and  the  algorithm  terminates. 

Consider  now  the  case  in  which  the  steps  become  linearly  dependent.  Suppose  that  sk 
is  a  linear  combination  of  the  previous  steps,  that  is, 


sk  —  §oSo  +  •  •  •  + 


(6.31) 


for  some  scalars  From  (6.31)  and  (6.29)  we  have  that 


Hkyk  =  HkAsk 

=  HoHkAso  +  ■  ■  ■  +  fit-i  HkAsk_i 
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=  HoHkyo  +  ■  ■  ■  +  § k-iHkyk-i 
—  ftAO  +  •  •  •  +  %k-lSk-l 
—  Sk- 

Since  y*  =  V  fk+i  —  V  /*  and  since  Sk  —  Pk  —  —  Hk  V  f(  from  (6.28),  we  have  that 


Hk(Vfk+l  -  V/t)  =  -fltV/t> 


which,  by  the  nonsingularity  of  Hk,  implies  that  V/*+ 1  =  0.  Therefore,  Xk+i  is  the  solution 
point.  □ 

The  relation  (6.29)  shows  that  when  /  is  quadratic,  the  secant  equation  is  satisfied 
along  all  previous  search  directions,  regardless  of  how  the  line  search  is  performed.  A  result 
like  this  can  be  established  for  BFGS  updating  only  under  the  restrictive  assumption  that 
the  line  search  is  exact,  as  we  show  in  the  next  section. 

For  general  nonlinear  functions,  the  SRI  update  continues  to  generate  good  Hessian 
approximations  under  certain  conditions. 

Theorem  6.2. 

Suppose  that  f  is  twice  continuously  differentiable,  and  that  its  Hessian  is  bounded  and 
Lipschitz  continuous  in  a  neighborhood  of  a  point  x*.  Let  {x>}  be  any  sequence  of  iterates  such 
thatxk  ->  x*  for  some  x*  e  R'!.  Suppose  in  addition  that  the  inequality  (6.26)  holds  for  all  k, 
for  some  r  e  (0,  1),  and  that  the  steps  Sk  are  uniformly  linearly  independent.  Then  the  matrices 
Bk  generated  by  the  SRI  updating  formula  satisfy 

lim  \\Bk-V2f(x*)\\  =0. 

k — >-oo 

The  term  “uniformly  linearly  independent  steps”  means,  roughly  speaking,  that  the 
steps  do  not  tend  to  fall  in  a  subspace  of  dimension  less  than  n.  This  assumption  is  usually, 
but  not  always,  satisfied  in  practice  (see  the  Notes  and  References  at  the  end  of  this  chapter). 


6.3  THE  BROYDEN  CLASS 


So  far,  we  have  described  the  BFGS,  DFP,  and  SRI  quasi-Newton  updating  formulae,  but 
there  are  many  others.  Of  particular  interest  is  the  Broyden  class,  a  family  of  updates  specified 
by  the  following  general  formula: 


Bk+i  —  Bk  — 


BksksfBk  ykyf 

sfBkSk  ylsk 


+  <pk{sl BkSk)vkvTk  , 


(6.32) 
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where  <pk  is  a  scalar  parameter  and 


Vk 


yk  _  Bksk 
_yTksk  sk Bksk  _ 


(6.33) 


The  BFGS  and  DFP  methods  are  members  of  the  Broyden  class — we  recover  BFGS  by  setting 
(pk  —  0  and  DFP  by  setting  <pk  —  1  in  (6.32).  We  can  therefore  rewrite  (6.32)  as  a  “linear 
combination”  of  these  two  methods,  that  is, 


Bk+ 1  =  (1  -  MBlTi  +  <t>k Bk+i ■ 


This  relationship  indicates  that  all  members  of  the  Broyden  class  satisfy  the  secant  equation 
(6.6),  since  the  BGFS  and  DFP  matrices  themselves  satisfy  this  equation.  Also,  since  BFGS  and 
DFP  updating  preserve  positive  definiteness  of  the  Hessian  approximations  when  sfyk  >  0, 
this  relation  implies  that  the  same  property  will  hold  for  the  Broyden  family  if  0  <  (pk  <  1. 

Much  attention  has  been  given  to  the  so-called  restricted  Broyden  class ,  which  is 
obtained  by  restricting  <pk  to  the  interval  [0,  1].  It  enjoys  the  following  property  when 
applied  to  quadratic  functions.  Since  the  analysis  is  independent  of  the  step  length,  we 
assume  for  simplicity  that  each  iteration  has  the  form 

Pk  —  fk,  xk+\  —  xk  +  pk.  (6.34) 


Theorem  6.3. 

Suppose  that  f  :  R"  -a  R  is  the  strongly  convex  quadratic  function  f(x)  —  bT x  + 
|  xT  Ax,  where  A  is  symmetric  and  positive  definite.  Letx0  beany  starting  point  for  the  iteration 
(6.34)  and  B0  be  any  symmetric  positive  definite  starting  matrix,  and  suppose  that  the  matrices 
Bk  are  updated  by  the  Broyden  formula  (6.32 )  with  <fik  €  [0,  1  ] .  Define  X\  <  <  •  •  •  <  Xkn 

to  be  the  eigenvalues  of  the  matrix 


A2Bf]A2.  (6.35) 

Then  for  all  k,  we  have 

minllf,  1}  <  Xk+1  <  max{^f ,  1},  i  —  1,2,...,  n.  (6.36) 

Moreover,  the  property  ( 6.36)  does  not  hold  if  the  Broyden  parameter  tpk  is  chosen  outside  the 
interval  [0,  1], 

Let  us  discuss  the  significance  of  this  result.  If  the  eigenvalues  Xk  of  the  matrix  (6.35) 
are  all  1,  then  the  quasi-Newton  approximation  Bk  is  identical  to  the  Hessian  A  of  the 
quadratic  objective  function.  This  situation  is  the  ideal  one,  so  we  should  be  hoping  for 
these  eigenvalues  to  be  as  close  to  1  as  possible.  In  fact,  relation  (6.36)  tells  us  that  the 
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eigenvalues  {if}  converge  monotonically  (but  not  strictly  monotonically)  to  1.  Suppose,  for 
example,  that  at  iteration  k  the  smallest  eigenvalue  is  if  =  0.7.  Then  (6.36)  tells  us  that 
at  the  next  iteration  lf+1  e  [0.7,  1].  We  cannot  be  sure  that  this  eigenvalue  has  actually 
moved  closer  to  1,  but  it  is  reasonable  to  expect  that  it  has.  In  contrast,  the  first  eigenvalue 
can  become  smaller  than  0.7  if  we  allow  <pk  to  be  outside  [0,  1].  Significantly,  the  result  of 
Theorem  6.3  holds  even  if  the  line  searches  are  not  exact. 

Although  Theorem  6.3  seems  to  suggest  that  the  best  update  formulas  belong  to  the 
restricted  Broyden  class,  the  situation  is  not  at  all  clear.  Some  analysis  and  computational 
testing  suggest  that  algorithms  that  allow  0 k  to  be  negative  (in  a  strictly  controlled  manner) 
may  in  fact  be  superior  to  the  BFGS  method.  The  SRI  formula  is  a  case  in  point:  It  is  a 
member  of  the  Broyden  class,  obtained  by  setting 


< Pk 


slyk 

sTkyk  ~  si Bksk  ’ 


but  it  does  not  belong  to  the  restricted  Broyden  class,  because  this  value  of  ( pk  may  fall 
outside  the  interval  [0,  1]. 

In  the  remaining  discussion  of  this  section,  we  determine  more  precisely  the  range  of 
values  of  <pk  that  preserve  positive  definiteness. 

The  last  term  in  (6.32)  is  a  rank-one  correction,  which  by  the  interlacing  eigenvalue 
theorem  (Theorem  A.  1 )  increases  the  eigenvalues  of  the  matrix  when  tpk  is  positive.  Therefore 
Bk+ 1  is  positive  definite  for  all  (pk  >  0.  On  the  other  hand,  by  Theorem  A.l  the  last  term  in 
(6.32)  decreases  the  eigenvalues  of  the  matrix  when  <pk  is  negative.  As  we  decrease  <pk,  this 
matrix  eventually  becomes  singular  and  then  indefinite.  A  little  computation  shows  that 
Bk+ 1  is  singular  when  <pk  has  the  value 

4>l  =  (6-37) 

I  —  Bk 

where 

=  (yfB, 

(: yTksk)2 

By  applying  the  Cauchy-Schwarz  inequality  (A.5)  to  (6.38),  we  see  that  fik  >  1  and  therefore 
<t>k  <  0.  Hence,  if  the  initial  Hessian  approximation  B0  is  symmetric  and  positive  definite, 
and  if  s[yk  >  0  and  <pk  >  (pk  for  each  k,  then  all  the  matrices  Bk  generated  by  Broyden’s 
formula  (6.32)  remain  symmetric  and  positive  definite. 

When  the  line  search  is  exact,  all  methods  in  the  Broyden  class  with  cpk  >  cpck  generate 
the  same  sequence  of  iterates.  This  result  applies  to  general  nonlinear  functions  and  is 
based  on  the  observation  that  when  all  the  line  searches  are  exact,  the  directions  generated 
by  Broyden-class  methods  differ  only  in  their  lengths.  The  line  searches  identify  the  same 
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minima  along  the  chosen  search  direction,  though  the  values  of  the  step  lengths  may  differ 
because  of  the  different  scaling. 

The  Broyden  class  has  several  remarkable  properties  when  applied  with  exact  line 
searches  to  quadratic  functions.  We  state  some  of  these  properties  in  the  next  theorem, 
whose  proof  is  omitted. 

Theorem  6.4. 

Suppose  that  a  method  in  the  Broyden  class  is  applied  to  the  strongly  convex  quadratic 
function  f  (x)  =  bTx+  \xT  Ax,  wherex  o  is  the  starting  point  arid  Bo  is  any  symmetric  positive 
definite  matrix.  Assume  that  a k  is  the  exact  step  length  and  that  <pk  >  <Pk  for  all  k,  where  (pk  is 
defined  by  (6.37).  Then  the  following  statements  are  true. 

(i)  The  iterates  are  independent  offik  and  converge  to  the  solution  in  at  mostn  iterations. 

(ii)  The  secant  equation  is  satisfied  for  all  previous  search  directions,  that  is, 

BkSj  —  yj,  j  =  k  —  l,k  —  2 - -  1. 

(iii)  If  the  starting  matrix  is  B0  —  I ,  then  the  iterates  are  identical  to  those  generated  by 
the  conjugate  gradient  method  (see  Chapter  5).  In  particular,  the  search  directions  are 
conjugate,  that  is, 

sj  As  j  —  0,  for  i  f-  j. 

(iv)  Ifn  iterations  are  performed,  we  have  Bn  —  A. 

Note  that  parts  (i),  (ii),  and  (iv)  of  this  result  echo  the  statement  and  proof  of  Theorem  6.1, 
where  similar  results  were  derived  for  the  SRI  update  formula. 

We  can  generalize  Theorem  6.4  slightly:  It  continues  to  hold  if  the  Hessian  approxi¬ 
mations  remain  nonsingular  but  not  necessarily  positive  definite.  (Hence,  we  could  allow 
4>k  to  be  smaller  than  fick,  provided  that  the  chosen  value  did  not  produce  a  singular  updated 
matrix.)  We  can  also  generalize  point  (iii)  as  follows.  If  the  starting  matrix  B0  is  not  the 
identity  matrix,  then  the  Broyden-class  method  is  identical  to  the  preconditioned  conjugate 
gradient  method  that  uses  Bq  as  preconditioner. 

We  conclude  by  commenting  that  results  like  Theorem  6.4  would  appear  to  be  of 
mainly  theoretical  interest,  since  the  inexact  line  searches  used  in  practical  implementations 
of  Broyden-class  methods  (and  all  other  quasi-Newton  methods)  cause  their  performance 
to  differ  markedly.  Nevertheless,  it  is  worth  noting  that  this  type  of  analysis  guided  much  of 
the  development  of  quasi-Newton  methods. 
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6.4  CONVERGENCE  ANALYSIS 


In  this  section  we  present  global  and  local  convergence  results  for  practical  implementations 
of  the  BFGS  and  SRI  methods.  We  give  more  details  for  BFGS  because  its  analysis  is  more 
general  and  illuminating  than  that  of  SRI.  The  fact  that  the  Hessian  approximations  evolve 
by  means  of  updating  formulas  makes  the  analysis  of  quasi-Newton  methods  much  more 
complex  than  that  of  steepest  descent  and  Newton’s  method. 

Although  the  BFGS  and  SRI  methods  are  known  to  be  remarkably  robust  in  practice, 
we  will  not  be  able  to  establish  truly  global  convergence  results  for  general  nonlinear  obj ective 
functions.  That  is,  we  cannot  prove  that  the  iterates  of  these  quasi-Newton  methods  approach 
a  stationary  point  of  the  problem  from  any  starting  point  and  any  (suitable)  initial  Hessian 
approximation.  In  fact,  it  is  not  yet  known  if  the  algorithms  enjoy  such  properties.  In  our 
analysis  we  will  either  assume  that  the  objective  function  is  convex  or  that  the  iterates  satisfy 
certain  properties.  On  the  other  hand,  there  are  well  known  local,  superlinear  convergence 
results  that  are  true  under  reasonable  assumptions. 

Throughout  this  section  we  use  ||  •  ||  to  denote  the  Euclidean  vector  or  matrix  norm, 
and  denote  the  Hessian  matrix  V2/(x)  by  G(x). 


GLOBAL  CONVERGENCE  OF  THE  BFGS  METHOD 

We  study  the  global  convergence  of  the  BFGS  method,  with  a  practical  line  search, 
when  applied  to  a  smooth  convex  function  from  an  arbitrary  starting  point  xq  and  from 
any  initial  Hessian  approximation  B0  that  is  symmetric  and  positive  definite.  We  state  our 
precise  assumptions  about  the  objective  function  formally,  as  follows. 

Assumption  6.1. 

(i)  The  objective  function  f  is  twice  continuously  differentiable. 

(ii)  The  level  set  C  —  {x  e  R"  |  f{x)  <  f{x0 )}  is  convex,  and  there  exist  positive  constants 
m  and  M  such  that 


m\\z\\2  <  zT G(x)z  <  M\\z 


(6.39) 


for  all  z  e  R"  andx  e  C. 

Part  (ii)  of  this  assumption  implies  that  G{x)  is  positive  definite  on  £  and  that  /  has  a 
unique  minimizer  x *  in  £. 

By  using  (6.12)  and  (6.39)  we  obtain 

yTksk  _  skGksk 


(6.40) 
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where  Gk  is  the  average  Hessian  defined  in  (6.11).  Assumption  6.1  implies  that  Gk  is 
positive  definite,  so  its  square  root  is  well-defined.  Therefore,  as  in  (6.21),  we  have  by 
defining  Zk  —  Gk  Sk  that 


y[yk  =  slG\sk  =  zl  GkZk 
yksk  sTkGksk  zTkzk 


(6.41) 


We  are  now  ready  to  present  the  global  convergence  result  for  the  BFGS  method.  It 
does  not  seem  to  be  possible  to  establish  a  bound  on  the  condition  number  of  the  Hessian 
approximations  Bk,  as  is  done  in  Section  3.2.  Instead,  we  will  introduce  two  new  tools 
in  the  analysis,  the  trace  and  determinant,  to  estimate  the  size  of  the  largest  and  smallest 
eigenvalues  of  the  Hessian  approximations.  The  trace  of  a  matrix  (denoted  by  trace(-))  is 
the  sum  of  its  eigenvalues,  while  the  determinant  (denoted  by  det(-))  is  the  product  of  the 
eigenvalues;  see  the  Appendix  for  a  brief  discussion  of  their  properties. 

Theorem  6.5. 

Let  B0  be  any  symmetric  positive  definite  initial  matrix,  and  let  x 0  be  a  starting  point 
for  which  Assumption  6. 1  is  satisfied.  Then  the  sequence  {xk }  generated  by  Algorithm  6.1  ( with 
e  =  0)  converges  to  the  minimizerx*  off. 

PROOF.  We  start  by  defining 


mk 


yjfik 

sk  sk  ’ 


Mk 


yjyf 

ylsk’ 


and  note  from  (6.40)  and  (6.41)  that 


mk  >  m,  Mk  <  M. 


(6.42) 


(6.43) 


By  computing  the  trace  of  the  BFGS  update  (6.19),  we  obtain  that 


trace  (B/t+i)  =  traced)  — 


II Bksk || : 2  ,  live- II 


+ 


skBksk  yksk 
(see  Exercise  6.11).  We  can  also  show  (Exercise  6.10)  that 


(6.44) 


det(Bk+1)  =  det(Bk) 


ylsk 
4  BkSk 


(6.45) 


We  now  define 


sk  Bksk 

list  linin’ 


sk  Bksk 
sfsk 


cos  Ok  — 


qk  — 


(6.46) 
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so  that  6k  is  the  angle  between  sk  and  BkSk ■  We  then  obtain  that 

II 2  _  II  II2  Ik  II2  SkBkSk  _  qk  (6 47) 

(' 4BkSk)2  Ikll2  COS2^ 

In  addition,  we  have  from  (6.42)  that 

dfit(B*+I)  =  det  =  det  (Bk)  —  .  (6.48) 

sk  sk  sjl  Bksk  qk 

We  now  combine  the  trace  and  determinant  by  introducing  the  following  function  of 
a  positive  definite  matrix  B : 


i [r(B)  =  tra  ce{B)  —  ln(det(B)), 


(6.49) 


where  ln(-)  denotes  the  natural  logarithm.  It  is  not  difficult  to  show  that  \j/(  B)  >  0;  see 
Exercise  6.9.  By  using  (6.42)  and  (6.44)-(6.49),  we  have  that 


f(Bk+i)  =  tra ce(Bk)  +  Mk  - 


6k 


cos2  6k 

—  ir{Bk)  +  (Mk  -  In  mk  -  1) 

qk  ,  ,  qk 


—  ln(det(Br-))  —  In  mk  +  In  qk 


+ 


1 


+  In 


cos2  6k  cos2  6k 


+  In  cos2  6k. 


(6.50) 


Now,  since  the  function  h{t)  =  1  —  t  +  In  t  is  nonpositive  for  all  t  >  0  (see  Exercise  6.8), 
the  term  inside  the  square  brackets  is  nonpositive,  and  thus  from  (6.43)  and  (6.50)  we  have 


k 

0  <  \jf{Bk+l)  <  i /f(B0)  +  c(k  +  1)  +  ]Tlncos2@;-,  (6.51) 

j= o 


where  we  can  assume  the  constant  c  =  M  —  In  m  —  1  to  be  positive,  without  loss  of  generality. 

We  now  relate  these  expressions  to  the  results  given  in  Section  3.2.  Note  from  the  form 
sk  —  —akB^1Vfk  of  the  quasi-Newton  iteration  that  cos  6k  defined  by  (6.46)  is  the  angle 
between  the  steepest  descent  direction  and  the  search  direction,  which  plays  a  crucial  role  in 
the  global  convergence  theory  of  Chapter  3.  From  (3.22),  (3.23)  we  know  that  the  sequence 
||  V  /a-  ||  generated  by  the  line  search  algorithm  is  bounded  away  from  zero  only  if  cos  6  j  — >  0. 

Let  us  then  proceed  by  contradiction  and  assume  that  cos  0j  — >  0.  Then  there  exists 
>  0  such  that  for  all  j  >  k\ ,  we  have 


lncos20y  <  —2c, 
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where  c  is  the  constant  defined  above.  Using  this  inequality  in  (6.51)  we  find  the  following 
relations  to  be  true  for  all  k  >  k\\ 


k\  k 

0  <  xlr(Bo)  +  c(k  +  1)  +  cos2  6j  +  ^  (-2c) 

7—0  j=h+l 

h 

—  ilf(B0)  +  ^  In  cos2  6j  +  2ck\  +  c  —  ck. 

7=0 

However,  the  right-hand-side  is  negative  for  large  k,  giving  a  contradiction.  Therefore,  there 
exists  a  subsequence  of  indices  {jk}k=  such  that  cos  0jk  >  8  >  0.  By  Zoutendijk’s  result 
(3.14)  this  limit  implies  that  liminf  ||V/).||  ->  0.  Since  the  problem  is  strongly  convex,  the 
latter  limit  is  enough  to  prove  that  ay  ->  x*.  □ 

Theorem  6.5  has  been  generalized  to  the  entire  restricted  Broyden  class,  except  for 
the  DFP  method.  In  other  words,  Theorem  6.5  can  be  shown  to  hold  for  all  fa  e  [0,  1) 
in  (6.32),  but  the  argument  seems  to  break  down  as  (pk  approaches  1  because  some  of  the 
self-correcting  properties  of  the  update  are  weakened  considerably. 

An  extension  of  the  analysis  just  given  shows  that  the  rate  of  convergence  of  the  iterates 
is  linear.  In  particular,  we  can  show  that  the  sequence  Hay  —  a*||  converges  to  zero  rapidly 
enough  that 


OO 

||a,t  —  a*||  <  oo.  (6.52) 

7-1 


We  will  not  prove  this  claim,  but  rather  establish  that  if  (6.52)  holds,  then  the  rate  of 
convergence  is  actually  superlinear. 


SUPERLINEAR  CONVERGENCE  OF  THE  BFGS  METHOD 

The  analysis  of  this  section  makes  use  of  the  Dennis  and  More  characterization  (3.36) 
of  superlinear  convergence.  It  applies  to  general  nonlinear — not  just  convex — objective 
functions.  For  the  results  that  follow  we  need  to  make  an  additional  assumption. 

Assumption  6.2. 

The  Hessian  matrix  G  is  Lipschitz  continuous  atx*,  that  is, 

||G(jc)-G(jc*)||  <  L\\x  —  x*||, 


for  all  x  nearx*,  where  L  is  a  positive  constant. 
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We  start  by  introducing  the  quantities 

h  =  GlJ2sk ,  yk  =  G-l'2yk,  Bk  =  G~1/2BkG~1/2, 


where  G*  =  G(x*)  and  x*  is  a  minimizer  of  /.  Similarly  to  (6.46),  we  define 


cos  6k  — 


skBksk 

11411115*5*11  ’ 


qk 


SjJhh 
II 4 II 2  ’ 


while  we  echo  (6.42)  and  (6.43)  in  defining 


Mk 


II 4 II 2 
ylh  ’ 


mk 


yTk~sk 

sksk  ' 


_ 1/2 

By  pre-  and  postmultiplying  the  BFGS  update  formula  (6. 19)  by  G*  and  grouping 
terms  appropriately,  we  obtain 


Bk+ 1  =  Bk 


Bksksk  Bk  ykyl 

skBksk  yk  h  ' 


Since  this  expression  has  precisely  the  same  form  as  the  BFGS  formula  (6.19),  it  follows 
from  the  argument  leading  to  (6.50)  that 


f{Bk+ 1)  =  f(Bk)  +  (M*  -  In  hi*  -  1) 


+ 


qk  ,  ,  qk 
- —  +  m - — 

cos2  6k  cos2  Qk . 


+  In  cos2  9k. 


(6.53) 


Recalling  (6.12),  we  have  that 


}’k  ~  G*sk  —  (G*  —  G*)sk, 


and  thus 


yk-sk  =  G~1/2(Gk  -  G*)G“1/25*. 

By  Assumption  6.2,  and  recalling  the  definition  (6.11),  we  have 

114  -5*11  <  ||G“1/2||2||4||||G*  -  G*||  <  ||G;1/2||2||5*||L6*, 

where  ek  is  defined  by 


ek  =  max{||x*+i  -x*\\,  \\xk  -x*||}. 
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We  have  thus  shown  that 


1 1 5ft  -  h 

II 5*  II 


<  cek, 


(6.54) 


for  some  positive  constant  c.  This  inequality  and  (6.52)  play  an  important  role  in  superlinear 
convergence,  as  we  now  show. 

Theorem  6.6. 

Suppose  that  f  is  twice  continuously  differentiable  and  that  the  iterates  generated  by  the 
BFGS  algorithm  converge  to  a  minimizerx*  at  which  Assumption  6.2  holds.  Suppose  also  that 
(6.52)  holds.  Then  xk  converges  to  x*  at  a  superlinear  rate. 


PROOF.  From  (6.54),  we  have  from  the  triangle  inequality  (A.4a)  that 


115**11  -  lUtll  <  C€k\\h\\,  II 5* II  -  II 3ft II  <  cetlliftll. 


so  that 


(1  —  ce*)P*ll  <  \\yk\\  <  (l  +  ce*)P*||. 

By  squaring  (6.54)  and  using  (6.55),  we  obtain 

(1  -  ce*)2p*||2  -  2 ylh  +  pill2  <  II  5ft  II 2  -  2yksk  +  Pill2  <  c2e^||s*||2, 

and  therefore 

2 ylh  >  (1  -  2 cek  +  c2e2k  +  1  -  c2 el)\\~sk\\2  =  2(1  -  ce*)Pll|2. 

It  follows  from  the  definition  of  mk  that 

mk  =  >  1  -  cek. 

\\sk\\2 

By  combining  (6.55)  and  (6.56),  we  obtain  also  that 

«,  =  !*!!<  i±$t. 

yk  sk  1  -  cek 


(6.55) 


(6.56) 


(6.57) 


Since  xk  x* ,  we  have  that  ek  -»  0,  and  thus  by  (6.57)  there  exists  a  positive  constant 

c  >  c  such  that  the  following  inequalities  hold  for  all  sufficiently  large  k: 


Mk<  1  + 


2c 


1  -  C€k 


Gc  <  1  +  cek. 


(6.58) 
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We  again  make  use  of  the  nonpositiveness  of  the  function  h{t)  —  1  —  t  +  In  t.  Therefore, 
we  have 


—x 

l  —  X 


ln(l 


=  h 


1  —  x 


<  0. 


Now,  for  k  large  enough  we  can  assume  that  ce k  <  |,  and  therefore 

—C€i- 

ln(l  -  cek)  >  - - —  >  -2 cek. 

1  -  C€k 

This  relation  and  (6.56)  imply  that  for  sufficiently  large  k,  we  have 
In mk  >  ln(l  —  cek)  >  —2 cek  >  —2 cek. 
We  can  now  deduce  from  (6.53),  (6.58),  and  (6.59)  that 


(6.59) 


0  <  ir(Bk+1)  <  i/f  ( Bk )  +  3 cek  +  In  cos2  9k  + 


clk 


cos2  9k 


In 


qk 

cos2  9k . 


(6.60) 


By  summing  this  expression  and  making  use  of  (6.52)  we  have  that 


OO 


E 


9j 


cos2  9j 


-In 


9j 

cos2  9j 


<  ir(B0)  +  3c  ^2  ej  <  +oo. 
j= o 


Since  the  term  in  the  square  brackets  is  nonpositive,  and  since  In  ^1/  cos2  9jj  >  0  for  all  j, 
we  obtain  the  two  limits 


lim  In - ^  =  0, 

j^>oc  cos2  9j 


lim  I  1  — 

j^oo  \ 


cos2  9j  cos2  9j  J 


=  0, 


which  imply  that 


lim  cos  9 j  =  1,  lim  q =  1.  (6.61) 

j^oo  j  >-oo 

The  essence  of  the  result  has  now  been  proven;  we  need  only  to  interpret  these  limits 
in  terms  of  the  Dennis-More  characterization  of  superlinear  convergence. 
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Recalling  (6.47),  we  have 


-,  —  1/2 


(Bk-G*)sk  ||2  ||(fi,-/)Ji-||2 


ici%ll2 


P*  II2 

||  || 2  -  2s  f Bksk  +  sfsk 


4k 


cos  Of 


~sTk~sk 


2  qk  + 1- 


Since  by  (6.61)  the  right-hand-side  converges  to  0,  we  conclude  that 


lim 

k—>OQ 


||(B,-GJ^,|| 

Pill 


=  0. 


The  limit  (3.36)  and  Theorem  3.6  imply  that  the  unit  step  length  ak  =  1  will  satisfy  the  Wolfe 
conditions  near  the  solution,  and  hence  that  the  rate  of  convergence  is  superlinear.  □ 


CONVERGENCE  ANALYSIS  OF  THE  SRI  METHOD 

The  convergence  properties  of  the  SRI  method  are  not  as  well  understood  as  those  of 
the  BFGS  method.  No  global  results  like  Theorem  6.5  or  local  superlinear  results  like  The¬ 
orem  6.6  have  been  established,  except  the  results  for  quadratic  functions  discussed  earlier. 
There  is,  however,  an  interesting  result  for  the  trust-region  SRI  algorithm,  Algorithm  6.2. 
It  states  that  when  the  objective  function  has  a  unique  stationary  point  and  the  condition 

(6.26)  holds  at  every  step  (so  that  the  SRI  update  is  never  skipped)  and  the  Hessian  ap¬ 
proximations  Bk  are  bounded  above,  then  the  iterates  converge  to  x*  at  an  (n  +  l)-step 
superlinear  rate.  The  result  does  not  require  exact  solution  of  the  trust-region  subproblem 

(6.27) . 

We  state  the  result  formally  as  follows. 

Theorem  6.7. 

Suppose  that  the  iterates  xk  are  generated  by  Algorithm  6.2.  Suppose  also  that  the  following 
conditions  hold: 

(cl)  The  sequence  of  iterates  does  not  terminate,  but  remains  in  a  closed,  bounded,  convex  set 
D,  on  which  the  function  f  is  twice  continuously  differentiable,  and  in  which  f  has  a 
unique  stationary  point  x* ; 

(c2)  the  Hessian  V2/(v*)  is  positive  definite,  and  V2/(v)  is  Lipschitz  continuous  in  a 
neighborhood  ofx*; 

(c3)  the  sequence  of  matrices  {Bk}  is  bounded  in  norm; 

(c4)  condition  (6.26)  holds  at  every  iteration,  where  r  is  some  constant  in  (0,  1). 
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Then  limbec  Xk  =  x* ,  and  we  have  that 


\\xk+n+\-X* 

lim  - 

k~>  °°  \\Xk  —  A*  || 


=  0. 


Note  that  the  BFGS  method  does  not  require  the  boundedness  assumption  (c3)  to 
hold.  As  we  have  mentioned  already,  the  SRI  update  does  not  necessarily  maintain  positive 
definiteness  of  the  Hessian  approximations  Bk.  In  practice,  Bk  may  be  indefinite  at  any 
iteration,  which  means  that  the  trust  region  bound  may  continue  to  be  active  for  arbitrarily 
large  k.  Interestingly,  however,  it  can  be  shown  that  the  SRI  Hessian  approximations  tend 
to  be  positive  definite  most  of  the  time.  The  precise  result  is  that 


number  of  indices  j  —  1,2 . k  for  which  B ,  is  positive  semidefinite 

lim  - - - - - - - - — - - 

A->oo  k 


=  1, 


under  the  assumptions  of  Theorem  6.7.  This  result  holds  regardless  of  whether  the  initial 
Hessian  approximation  is  positive  definite  or  not. 


NOTES  AND  REFERENCES 

For  a  comprehensive  treatment  of  quasi-Newton  methods  see  Dennis  and  Schn¬ 
abel  [92],  Dennis  and  More  [91],  and  Fletcher  [101].  A  formula  for  updating  the  Cholesky 
factors  of  the  BFGS  matrices  is  given  in  Dennis  and  Schnabel  [92]. 

Several  safeguards  and  modifications  of  the  SRI  method  have  been  proposed,  but 
the  condition  (6.26)  is  favored  in  the  light  of  the  analysis  of  Conn,  Gould,  and  Toint  [71]. 
Computational  experiments  by  Conn,  Gould,  and  Toint  [70,  73]  and  Khalfan,  Byrd,  and 
Schnabel  [181],  using  both  line  search  and  trust-region  approaches,  indicate  that  the  SRI 
method  appears  to  be  competitive  with  the  BFGS  method.  The  proof  of  Theorem  6.7  is 
given  in  Byrd,  Khalfan,  and  Schnabel  [51]. 

A  study  of  the  convergence  of  BFGS  matrices  for  nonlinear  problems  can  be  found  in 
Ge  and  Powell  [119]  and  Boggs  and  Tolle  [32] ;  however,  the  results  are  not  as  satisfactory  as 
for  SRI  updating. 

The  global  convergence  of  the  BFGS  method  was  established  by  Powell  [246].  This 
result  was  extended  to  the  restricted  Broyden  class,  except  for  DFP,  by  Byrd,  Nocedal,  and 
Yuan  [53].  For  a  discussion  of  the  self-correcting  properties  of  quasi-Newton  methods 
see  Nocedal  [229].  Most  of  the  early  analysis  of  quasi-Newton  methods  was  based  on  the 
bounded  deterioration  principle.  This  is  a  tool  for  the  local  analysis  that  quantifies  the  worst- 
case  behavior  of  quasi-Newton  updating.  Assuming  that  the  starting  point  is  sufficiently 
close  to  the  solution  x*  and  that  the  initial  Hessian  approximation  is  sufficiently  close  to 
V2/(v*),  one  can  use  the  bounded  deterioration  bounds  to  prove  that  the  iteration  cannot 
stray  away  from  the  solution.  This  property  can  then  be  used  to  show  that  the  quality  of  the 
quasi-Newton  approximations  is  good  enough  to  yield  superlinear  convergence.  For  details, 
see  Dennis  and  More  [91]  or  Dennis  and  Schnabel  [92]. 
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&  Exercises 


&  6.1 


(a)  Show  that  if  /  is  strongly  convex,  then  (6.7)  holds  for  any  vectors  x±  and  Xk+i- 

(b)  Give  an  example  of  a  function  of  one  variable  satisfying  g( 0)  =  —  1  and  g(l)  =  —  | 
and  show  that  (6.7)  does  not  hold  in  this  case. 

&  6.2  Show  that  the  second  strong  Wolfe  condition  (3.7b)  implies  the  curvature 

condition  (6.7). 

&  6.3  Verify  that  (6.19)  and  (6.17)  are  inverses  of  each  other. 

&  6.4  Use  the  Sherman-Morrison  formula  (A.27)  to  show  that  (6.24)  is  the  inverse  of 

(6.25). 


&  6.5  Prove  the  statements  (ii)  and  (iii)  given  in  the  paragraph  following  (6.25). 

&  6.6  The  square  root  of  a  matrix  A  is  a  matrix  A1/2  such  that  A^A1/2  =  A.  Show 

that  any  symmetric  positive  definite  matrix  A  has  a  square  root,  and  that  this  square  root 
is  itself  symmetric  and  positive  definite.  (Hint:  Use  the  factorization  A  =  U  DUJ  (A.16), 
where  U  is  orthogonal  and  D  is  diagonal  with  positive  diagonal  elements.) 

&  6.7  Use  the  Cauchy- Schwarz  inequality  (A.5)  to  verify  that  /i*  >  1,  where  /zj,  is 

defined  by  (6.38). 

&  6.8  Define  h(t)  —  1  —  t  +  lnr,  and  note  that  h'{t)  —  —  1  +  1/f,  h"(t)  =  —  1/f 2  <  0, 

/i(l)  —  0,  and  h'(  1)  =  0.  Show  that  h{t)  <  0  for  all  t  >  0. 

6.9  Denote  the  eigenvalues  of  the  positive  definite  matrix  B  by  M ,  X2 , . . . ,  k„ ,  where 
0  <  A-i  <  A.2  <  •  •  •  <  A.„.  Show  that  the  i/r  function  defined  in  (6.49)  can  be  written  as 


n 

if{B)  =  7>,-  -  In  A.,). 

i=i 


Use  this  form  to  show  that  i //(B)  >  0. 

6.10  The  object  of  this  exercise  is  to  prove  (6.45). 

(a)  Show  that  det(7  +  xyT)  =  1  +  yTx,  where  x  and  y  are  H-vectors.  Hint:  Assuming  that 
x  /  0,  we  can  find  vectors  Wi,W2, w„- j  such  that  the  matrix  Q  defined  by 


Q  —  [x,  w i,  w2,  ■  ■  ■ ,  w„_ l] 
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is  nonsingular  and  x  =  Qe  i,  where  e\  —  (1,  0,  0,  . . . ,  0)r.  If  we  define 

yT  Q  =  (Zl,  Z2,  ■  ■  ■  ,  Zb), 


then 


zi  =  yTQe1  =  yTQ(Q~lx)  =  yTx, 


and 


det(7  +  xyT)  —  det(g  '(/  +  xyT)Q)  =  det(7  +  ex yT Q). 

(b)  Use  a  similar  technique  to  prove  that 

det(7  +  xyT  +  uvT)  —  (1  +  yTx){  1  +  vTu)  —  (xTv  )(yTu). 

(c)  Use  this  relation  to  establish  (6.45). 

i#7  6.1 1  Use  the  properties  of  the  trace  of  a  symmetric  matrix  and  the  formula  (6. 19)  to 

prove  (6.44). 

i#7  6.12  Show  that  if  /  satisfies  Assumption  6.1  and  if  the  sequence  of  gradients  satisfies 

lim  inf  ||  V/*  ||  =  0,  then  the  whole  sequence  of  iterates  x  converges  to  the  solution  x*. 


Chapter 


Large-Scale 

Unconstrained 

Optimization 


Many  applications  give  rise  to  unconstrained  optimization  problems  with  thousands  or 
millions  of  variables.  Problems  of  this  size  can  be  solved  efficiently  only  if  the  storage 
and  computational  costs  of  the  optimization  algorithm  can  be  kept  at  a  tolerable  level.  A 
diverse  collection  of  large-scale  optimization  methods  has  been  developed  to  achieve  this 
goal,  each  being  particularly  effective  for  certain  problem  types.  Some  of  these  methods 
are  straightforward  adaptations  of  the  methods  described  in  Chapters  3,  4,  and  6.  Other 
approaches  are  modifications  of  these  basic  methods  that  allow  approximate  steps  to  be 
calculated  at  lower  cost  in  computation  and  storage.  One  set  of  approaches  that  we  have 
already  discussed — the  nonlinear  conjugate  gradient  methods  of  Section  5.2 — can  be  applied 


7.1  . 


Inexact  Newton  Methods  165 


to  large  problems  without  modification,  because  of  its  minimal  storage  demands  and  its 
reliance  on  only  first-order  derivative  information. 

The  line  search  and  trust-region  Newton  algorithms  of  Chapters  3  and  4  require  matrix 
factorizations  of  the  Hessian  matrices  V2/)-.  In  the  large-scale  case,  these  factorizations  can 
be  carried  out  using  sparse  elimination  techniques.  Such  algorithms  have  received  much 
attention,  and  high  quality  software  implementations  are  available.  If  the  computational 
cost  and  memory  requirements  of  these  sparse  factorization  methods  are  affordable  for  a 
given  application,  and  if  the  Hessian  matrix  can  be  formed  explicitly,  Newton  methods 
based  on  sparse  factorizations  constitute  an  effective  approach  for  solving  such  problems. 

Often,  however,  the  cost  of  factoring  the  Hessian  is  prohibitive,  and  it  is  preferable 
to  compute  approximations  to  the  Newton  step  using  iterative  linear  algebra  techniques. 
Section  7.1  discusses  inexact  Newton  methods  that  use  these  techniques,  in  both  line  search 
and  trust-region  frameworks.  The  resulting  algorithms  have  attractive  global  convergence 
properties  and  may  be  superlinearly  convergent  for  suitable  choices  of  parameters.  They  find 
effective  search  directions  when  the  Hessian  V2  is  indefinite,  and  may  even  be  implemented 
in  a  “Hessian-free”  manner,  without  explicit  calculation  or  storage  of  the  Hessian. 

The  Hessian  approximations  generated  by  the  quasi-Newton  approaches  of  Chapter  6 
are  usually  dense,  even  when  the  true  Hessian  is  sparse,  and  the  cost  of  storing  and  working 
with  these  approximations  can  be  excessive  for  large  n.  Section  7.2  discusses  limited-memory 
variants  of  the  quasi-Newton  approach,  which  use  Hessian  approximations  that  can  be 
stored  compactly  by  using  just  a  few  vectors  of  length  n.  These  methods  are  fairly  robust, 
inexpensive,  and  easy  to  implement,  but  they  do  not  converge  rapidly.  Another  approach, 
discussed  briefly  in  Section  7.3,  is  to  define  quasi-Newton  approximate  Hessians  B *  that 
preserve  sparsity,  for  example  by  mimicking  the  sparsity  pattern  of  the  Hessian. 

In  Section  7.4,  we  note  that  objective  functions  in  large  problems  often  possess  a 
structural  property  known  as  partial  separability,  which  means  they  can  be  decomposed 
into  a  sum  of  simpler  functions,  each  of  which  depends  on  only  a  small  subspace  of  R" . 
Effective  Newton  and  quasi-Newton  methods  that  exploit  this  property  have  been  developed. 
Such  methods  usually  converge  rapidly  and  are  robust,  but  they  require  detailed  information 
about  the  objective  function,  which  can  be  difficult  to  obtain  in  some  applications. 

We  conclude  the  chapter  with  a  discussion  of  software  for  large-scale  unconstrained 
optimization  problems. 


7.1  INEXACT  NEWTON  METHODS 


Recall  from  (2.15)  that  the  basic  Newton  step  p %  is  obtained  by  solving  the  symmetric  n  x  n 
linear  system 


V2fkPk  =  -v/t. 


(7.1) 


In  this  section,  we  describe  techniques  for  obtaining  approximations  to  that  are 
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inexpensive  to  calculate  but  are  good  search  directions  or  steps.  These  approaches  are 
based  on  solving  (7.1)  by  using  the  conjugate  gradient  (CG)  method  (see  Chapter  5)  or 
the  Lanczos  method,  with  modifications  to  handle  negative  curvature  in  the  Hessian  V2/)-. 
Both  line  search  and  trust-region  approaches  are  described  here.  We  refer  to  this  family  of 
methods  by  the  general  name  inexact  Newton  methods. 

The  use  of  iterative  methods  for  (7.1)  spares  us  from  concerns  about  the  expense 
of  a  direct  factorization  of  the  Hessian  V2/)-  and  the  fill-in  that  may  occur  during  this 
process.  Further,  we  can  customize  the  solution  strategy  to  ensure  that  the  rapid  convergence 
properties  associated  with  Newton’s  methods  are  not  lost  in  the  inexact  version.  In  addition, 
as  noted  below,  we  can  implement  these  methods  in  a  Hessian-free  manner,  so  that  the 
Hessian  V2/).  need  not  be  calculated  or  stored  explicitly  at  all. 

We  examine  first  how  the  inexactness  in  the  step  calculation  determines  the  local 
convergence  properties  of  inexact  Newton  methods.  We  then  consider  line  search  and 
trust-region  approaches  based  on  using  CG  (possibly  with  preconditioning)  to  obtain  an 
approximate  solution  of  (7.1).  Finally,  we  discuss  the  use  of  the  Lanczos  method  for  solving 

(7.1)  approximately. 

LOCAL  CONVERGENCE  OF  INEXACT  NEWTON  METHODS 

Most  rules  for  terminating  the  iterative  solver  for  (7.1)  are  based  on  the  residual 


n  =  V2fkPk  +  V/,.  (7.2) 

where  pk  is  the  inexact  Newton  step.  Usually,  we  terminate  the  CG  iterations  when 

Hr* II  <  i?*||V/*||,  (7.3) 

where  the  sequence  {qk}  (with  0  <  rjk  <  1  for  all  k)  is  called  the  forcing  sequence. 

We  now  study  how  the  rate  of  convergence  of  inexact  Newton  methods  based  on 

(7.1) -(7.3)  is  affected  by  the  choice  of  the  forcing  sequence.  The  next  two  theorems  apply 
not  just  to  Newton-CG  procedures  but  to  all  inexact  Newton  methods  whose  steps  satisfy 

(7.2)  and  (7.3). 

Our  first  result  says  that  local  convergence  is  obtained  simply  by  ensuring  that  r)k  is 
bounded  away  from  1. 

Theorem  7.1. 

Suppose  thatV2  f  (x)  exists  and  is  continuous  in  a  neighborhood  of  a  minimizerx*,  with 
V2/(x*)  is  positive  definite.  Consider  the  iteration  xk+l  —  xk  +  pk  where  pk  satisfies  (7.3),  and 
assume  that  r)k  <  rj  for  some  constant  ri  e  [0,  1).  Then,  if  the  starting  point  xo  is  sufficiently 
nearx*,  the  sequence  {xk}  converges  to  x*  and  satisfies 

\\V2f(x*)(xk+1  -  x*)||  <  Oil  V2f(x*)(xk  -  x*)||,  (7.4) 


for  some  constant  fj  with  r]  <  rj  <  1. 
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Rather  than  giving  a  rigorous  proof  of  this  theorem,  we  present  an  informal  derivation 
that  contains  the  essence  of  the  argument  and  motivates  the  next  result. 

Since  the  Hessian  matrix  V2/  is  positive  definite  at  x*  and  continuous  near  x*,  there 
exists  a  positive  constant  L  such  that  ||  (V2/).)-1 1|  <  L  for  all  xk  sufficiently  close  to  x*.  We 
therefore  have  from  (7.2)  that  the  inexact  Newton  step  satisfies 

IIP* II  <£(HV/*||  +  IM)  <2L||V/t ||, 

where  the  second  inequality  follows  from  (7.3)  and  %  <  1-  Using  this  expression  together 
with  Taylor’s  theorem  and  the  continuity  of  V2/(x),  we  obtain 

V/*+i  =  V/,  +  V2fkPk  +  [  [V/(x*  +  tpk)  -  Vf(xk)]pk  dt 

JO 

=  V/*  +  V2/*W  +  o(||pt||) 

=  V/t-(V/*-r>)  +  o(||V/t||) 

=  rk  +  o(\\Vfk\\).  (7.5) 

Taking  norms  and  recalling  (7.3),  we  have  that 

l|V/*+i||  <  m\\Vfk\\  +o(||V/*||)  <  (%-  +  o(l)))||V/,||.  (7.6) 

When.rr-  is  close  enough  to  x*  that  the  o(l)  term  in  the  last  estimate  is  bounded  by  (1  —  /;)/2, 
we  have 


l|V/*+i||  <  (rn  +  (1  -  »;)/2)||v/,||  <  (7.7) 

so  the  gradient  norm  decreases  by  a  factor  of  (1  +  rj)/2  at  this  iteration.  By  choosing  the 
initial  point  xo  sufficiently  close  to  x*,  we  can  ensure  that  this  rate  of  decrease  occurs  at 
every  iteration. 

To  prove  (7.4),  we  note  that  under  our  smoothness  assumptions,  we  have 
V/*  =  V2/(x*)(xA-  -  x*)  +  o( \\xk  -  x*||). 

Hence  it  can  be  shown  that  for  Xk  close  to  x*,  the  gradient  V  /)-  differs  from  the  scaled  error 
V2/(x*)(x^  —  x*)  by  only  a  relatively  small  perturbation.  A  similar  estimate  holds  at  x*+i> 
so  (7.4)  follows  from  (7.7). 

From  (7.6),  we  have  that 


l|V/*+i 


IIV.MI 


<  rjk  +  o(l). 


(7.8) 
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If  lim^oo  qk  —  0,  we  have  from  this  expression  that 


lim 

k—>o o 


II  V/c+i 
l|V/t|| 


=  0, 


indicating  Q-superlinear  convergence  of  the  gradient  norms  ||V/*||  to  zero.  Superlinear 
convergence  of  the  iterates  {x>}  to  x*  can  be  proved  as  a  consequence. 

We  can  obtain  quadratic  convergence  by  making  the  additional  assumption  that  the 
Hessian  V2/(x)  is  Lipschitz  continuous  near  x*.  In  this  case,  the  estimate  (7.5)  can  be 
tightened  to 


V/,+1  =  r,+  0(||V/,||2). 

By  choosing  the  forcing  sequence  so  that  %  =  0(||V/i||),  we  have  from  this  expression 
that 


II V  fk+i  II  =  0(||V/*H2), 

indicating  Q-quadratic  convergence  of  the  gradient  norms  to  zero  (and  thus  also  Q-quadratic 
convergence  of  the  iterates  Xk  to  x*).  The  last  two  observations  are  summarized  in  the 
following  theorem. 

Theorem  7.2. 

Suppose  that  the  conditions  of  Theorem  7.1  hold,  and  assume  that  the  iterates  {x^} 
generated  by  the  inexact  Newton  method  converge  to  x*.  Then  the  rate  of  convergence  is 
superlinear  if  rjk  ->  0.  If  in  addition,  V2/(x)  is  Lipschitz  continuous  for  x  near  x*  and  if 
rjk  —  0(||  V/* ||),  then  the  convergence  is  quadratic. 

To  obtain  superlinear  convergence,  we  can  set,  for  example,  qk  =  min  (0.5,  Vll^/tll)  > 
the  choice  qk  —  min(0.5,  ||  V/*||)  would  yield  quadratic  convergence. 

All  the  results  presented  in  this  section,  which  are  proved  by  Dembo,  Eisenstat,  and 
Steihaug  [89],  are  local  in  nature:  They  assume  that  the  sequence  {x*}  eventually  enters 
the  near  vicinity  of  the  solution  x*.  They  also  assume  that  the  unit  step  length  ak  —  1  is 
taken  and  hence  that  globalization  strategies  do  not  interfere  with  rapid  convergence.  In 
the  following  pages  we  show  that  inexact  Newton  strategies  can,  in  fact,  be  incorporated 
in  practical  line  search  and  trust-region  implementations  of  Newton’s  method,  yielding 
algorithms  with  good  local  and  global  convergence  properties.  We  start  with  a  line  search 
approach. 

LINE  SEARCH  NEWTON-CG  METHOD 

In  the  line  search  Newton-CG  method,  also  known  as  the  truncated  Newton  method,  we 
compute  the  search  direction  by  applying  the  CG  method  to  the  Newton  equations  (7.1)  and 
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attempt  to  satisfy  a  termination  test  of  the  form  (7.3).  However,  the  CG  method  is  designed 
to  solve  positive  definite  systems,  and  the  Hessian  V2/)  may  have  negative  eigenvalues  when 
Xk  is  not  close  to  a  solution.  Therefore,  we  terminate  the  CG  iteration  as  soon  as  a  direction 
of  negative  curvature  is  generated.  This  adaptation  of  the  CG  method  produces  a  search 
direction  p k  that  is  a  descent  direction.  Moreover,  the  adaptation  guarantees  that  the  fast 
convergence  rate  of  the  pure  Newton  method  is  preserved,  provided  that  the  step  length 
ak  =  1  is  used  whenever  it  satisfies  the  acceptance  criteria. 

We  now  describe  Algorithm  7.1,  a  line  search  algorithm  that  uses  a  modification  of 
Algorithm  5.2  as  the  inner  iteration  to  compute  each  search  direction  pk.  For  purposes  of 
this  algorithm,  we  write  the  linear  system  (7.1)  in  the  form 


BtP  =  -V/t,  (7.9) 

where  B represents  V2/)..  For  the  inner  CG  iteration,  we  denote  the  search  directions  by 
dj  and  the  sequence  of  iterates  that  it  generates  by  z.j.  When  B k  is  positive  definite,  the 
inner  iteration  sequence  {z/}  will  converge  to  the  Newton  step  pf  that  solves  (7.9).  At  each 
major  iteration,  we  define  a  tolerance  ek  that  specifies  the  required  accuracy  of  the  computed 
solution.  For  concreteness,  we  choose  the  forcing  sequence  to  be  r\k  =  min(0.5,  VII V  fk\\) 
to  obtain  a  superlinear  convergence  rate,  but  other  choices  are  possible. 

Algorithm  7.1  (Line  Search  Newton-CG). 

Given  initial  point  Xq; 
for  k  =  0,  1,  2,  . . . 

Define  tolerance  ek  =  min(0.5,  Vll^/fcll)ll  V/*||; 

Set  zo  =  0 ,r0  —  V/*,  d0  —  -r0  —  -V/*; 
for  j  —  0,  1,  2, . . . 
if  dj  Bkdj  <  0 
if  7  =  0 

return  pk  —  -V/*; 

else 

return  pk  =  zy. 

Set  Uj  —  rj  rj  /dj  Bkdj; 

SetzJ+i  =  zj  +  oLjdy 
Set  rj+ 1  =  rj  +  oijBkdy 

if  II  O+i  II  <  €k 

return  pk  —  z,j+u 
Set  t$j+\  -  rj+lrj+l/rjry 
Setdj+i  —  —  r  j+ 1  +  Pj+ldj; 

end  (for) 

Set  xk+i  —  xk  +  akpk,  where  ak  satisfies  the  Wolfe,  Goldstein,  or 
Armijo  backtracking  conditions  (using  ak  —  1  if  possible); 


end 
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The  main  differences  between  the  inner  loop  of  Algorithm  7.1  and  Algorithm  5.2  are 
that  the  specific  starting  point  zo  —  0  is  used;  the  use  of  a  positive  tolerance  Ca  allows  the  CG 
iterations  to  terminate  at  an  inexact  solution;  and  the  negative  curvature  test  dj B^dj  <  0 
ensures  that  pk  is  a  descent  direction  for  /  at  **.  If  negative  curvature  is  detected  on  the 
first  inner  iteration  j  —  0,  the  returned  direction  pk  —  — V  /a-  is  both  a  descent  direction 
and  a  direction  of  nonpositive  curvature  for  /  at  Xk- 

We  can  modify  the  CG  iterations  in  Algorithm  7.1  by  introducing  preconditioning, 
in  the  manner  described  in  Chapter  5. 

Algorithm  7. 1  is  well  suited  for  large  problems,  but  it  has  a  weakness.  When  the  Hessian 
V2  fk  is  nearly  singular,  the  line  search  Newton-CG  direction  can  be  long  and  of  poor  quality, 
requiring  many  function  evaluations  in  the  line  search  and  giving  only  a  small  reduction  in 
the  function.  To  alleviate  this  difficulty,  we  can  try  to  normalize  the  Newton  step,  but  good 
rules  for  doing  so  are  difficult  to  determine.  They  run  the  risk  of  undermining  the  rapid 
convergence  of  Newton’s  method  in  the  case  where  the  pure  Newton  step  is  well  scaled.  It 
is  preferable  to  introduce  a  threshold  value  into  the  test  dj  Bdj  <  0,  but  good  choices  of 
the  threshold  are  difficult  to  determine.  The  trust-region  Newton-CG  method  described 
below  deals  more  effectively  with  this  problematic  situation  and  is  therefore  preferable,  in 
our  opinion. 

The  line  search  Newton-CG  method  does  not  require  explicit  knowledge  of  the 
Hessian  Bk  —  V 2  fk .  Rather,  it  requires  only  that  we  can  supply  Hessian-vector  products 
of  the  form  V2/a-a/  for  any  given  vector  d.  When  the  user  cannot  easily  supply  code  to 
calculate  second  derivatives,  or  where  the  Hessian  requires  too  much  storage,  the  techniques 
of  Chapter  8  (automatic  differentiation  and  finite  differencing)  can  be  used  to  calculate  these 
Hessian-vector  products.  Methods  of  this  type  are  known  as  Hessian-free  Newton  methods. 

To  illustrate  the  finite-differencing  technique  briefly,  we  use  the  approximation 


V2/^ 


V/(w  +  hd)  -  V/(xa) 
h 


(7.10) 


for  some  small  differencing  interval  h .  It  is  easy  to  prove  that  the  accuracy  of  this  approxi¬ 
mation  is  0{h);  appropriate  choices  of  h  are  discussed  in  Chapter  8.  The  price  we  pay  for 
bypassing  the  computation  of  the  Hessian  is  one  new  gradient  evaluation  per  CG  iteration. 


TRUST-REGION  NEWTON-CG  METHOD 

In  Chapter  4,  we  discussed  approaches  for  finding  an  approximate  solution  of  the 
trust-region  subproblem  (4.3)  that  produce  improvements  on  the  Cauchy  point.  Here  we 
define  a  modified  CG  algorithm  for  solving  the  subproblem  with  these  properties.  This 
algorithm,  due  to  Steihaug  [281],  is  specified  below  as  Algorithm  7.2.  A  complete  algorithm 
for  minimizing  /  is  obtained  by  using  Algorithm  7.2  to  generate  the  step  pk  required  by 
Algorithm  4.1  of  Chapter  4,  for  some  choice  of  tolerance  e*  at  each  iteration. 
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We  use  notation  similar  to  (7.9)  to  define  the  trust-region  subproblem  for  which 
Steihaug’s  method  finds  an  approximate  solution: 


min  mk{p)  =  fk  +  {V  fk)T  p  +  \pT Bkp  subject  to  \\p\\  <  Ak,  (7.11) 

neR'1  ^ 


where  Bk  =  V2/*.  As  in  Algorithm  7.1,  we  use  c/;  to  denote  the  search  directions  of  this 
modified  CG  iteration  and  zj  to  denote  the  sequence  of  iterates  that  it  generates. 

Algorithm  7.2  (CG-Steihaug). 

Given  tolerance  €k  >  0; 

Set  Co  =  0,  r0  =  V/fc,  d0  =  -r0  —  -  V  fk; 

if  II foil  <  e* 

return  pk  ~  zo  =  0; 
for  j  —  0,  1,  2, . . . 
if  dj  Bkdj  <  0 

Find  r  such  that  pk  =  zj  +  zdj  minimizes  mk(pk)  in  (4.5) 
and  satisfies  \\pk\\  =  Ak; 

return  pk; 

Set  a  j  —  rjrj/ dj  Bk  dj ; 

Set  Zj+\  —  Zj  +oijdj; 
if  lkj+i||  >  Ak 

Find  r  >  0  such  that  pk  —  Zj  +  rdj  satisfies  \\pk\\  —  Ak; 

return  pk; 

Set  rj+i  —  Yj  +  <XjBkdj\ 

if  Ik 7+i  II  <  ek 

return  pk  —  Zj+u 
Set  Pj+i  =  rj+1rj+1/rjrj-. 

Set  dj+\  —  -rj+l  +  Pj+idj ; 

end  (for). 


The  first  if  statement  inside  the  loop  stops  the  method  if  its  current  search  direction 
dj  is  a  direction  of  nonpositive  curvature  along  Bk,  while  the  second  if  statement  inside  the 
loop  causes  termination  if  Zj+i  violates  the  trust-region  bound.  In  both  cases,  the  method 
returns  the  step  pk  obtained  by  intersecting  the  current  search  direction  with  the  trust-region 
boundary. 

The  choice  of  the  tolerance  €k  at  each  call  to  Algorithm  7.2  is  important  in  keeping  the 
overall  cost  of  the  trust-region  Newton-CG  method  low.  Near  a  well-behaved  solution  x*, 
the  trust-region  bound  becomes  inactive,  and  the  method  reduces  to  the  inexact  Newton 
method  analyzed  in  Theorems  7.1  and  7.2.  Rapid  convergence  can  be  obtained  in  these 
circumstances  by  choosing  ek  in  a  similar  fashion  to  Algorithm  7.1. 
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The  essential  differences  between  Algorithm  5.2  and  the  inner  loop  of  Algorithm  7.2 
are  that  the  latter  terminates  when  it  violates  the  trust-region  bound  ||p||  <  A,  when  it 
encounters  a  direction  of  negative  curvature  in  V2/*,  or  when  it  satisfies  a  convergence 
tolerance  defined  by  a  parameter  €k ■  In  these  respects,  Algorithm  7.2  is  quite  similar  to  the 
inner  loop  of  Algorithm  7.1. 

The  initialization  of  zo  to  zero  in  Algorithm  7.2  is  a  crucial  feature  of  the  algorithm. 
Provided  ||  V  ||2  >  e*,  Algorithm  7.2  terminates  at  a  point  pk  for  which  nik(pk)  <  m*:(p £), 
that  is,  when  the  reduction  in  model  function  equals  or  exceeds  that  of  the  Cauchy  point. 
To  demonstrate  this  fact,  we  consider  several  cases.  First,  if  df  Bkdo  —  (V  fk)T BkV  fk  <  0, 
then  the  condition  in  the  first  if  statement  is  satisfied,  and  the  algorithm  returns  the  Cauchy 
point  p  —  —  A^(V/^,)/||V/i-||.  Otherwise,  Algorithm  7.2  defines  Zi  as  follows: 


z  i  —  oiodo 


T 

Bq  Bkdo 


do  —  — 


{jfkfjfk 

(Vfk)TBkVfk 


If  ||  Z]  ||  <  A  k,  then  zi  is  exactly  the  Cauchy  point.  Subsequent  steps  of  Algorithm  7.2  ensure 
that  the  final  pk  satisfies  nik(pk)  <  /w*(zi).  When  ||zi  ||  >  A k,  on  the  other  hand,  the  second 
if  statement  is  activated,  and  Algorithm  7.2  terminates  at  the  Cauchy  point,  proving  our 
claim.  This  property  is  important  for  global  convergence:  Since  each  step  is  at  least  as  good 
as  the  Cauchy  point  in  reducing  the  model  m*,  Algorithm  7.2  is  globally  convergent. 

Another  crucial  property  of  the  method  is  that  each  iterate  Zj  is  larger  in  norm  than 
its  predecessor.  This  property  is  another  consequence  of  the  initialization  zo  =  0.  Its  main 
implication  is  that  it  is  acceptable  to  stop  iterating  as  soon  as  the  trust-region  boundary  is 
reached,  because  no  further  iterates  giving  a  lower  value  of  the  model  function  nik  will  lie 
inside  the  trust  region.  We  state  and  prove  this  property  formally  in  the  following  theorem, 
which  makes  use  of  the  expanding  subspace  property  of  the  conjugate  gradient  algorithm, 
described  in  Theorem  5.2. 


Theorem  7.3. 

The  sequence  of  vectors  {z7  }  generated  by  Algorithm  7.2  satisfies 

0  =  Hzolb  <  <  llzjlb  <  IIZ./+1II2  <  <  ||p* II 2  <  A*. 

PROOF.  We  first  show  that  the  sequences  of  vectors  generated  by  Algorithm  7.2  satisfy 
Zji'j  =  0  for  j  >  0  and  zjd,  >  0  for  j  >  1. 

Algorithm  7.2  computes  Zj+i  recursively  in  terms  of  Zj;  but  when  all  the  terms  of  this 
recursion  are  written  explicitly,  we  see  that 

j~ 1  7-1 

Zj  =  zo+ y^o'/d,  =  y>t/„ 

i=0  i=0 
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since  zo  —  0.  Multiplying  by  rj  and  applying  the  expanding  subspace  property  of  conjugate 
gradients  (see  Theorem  5.2),  we  obtain 


j- 1 

zjri  =  Y.  a'd'  rJ  =  °-  (7‘12) 

i=0 

An  induction  proof  establishes  the  relation  zTjdj  >  0.  By  applying  the  expanding 
subspace  property  again,  we  obtain 


z\di  =  (a0do)r(-ri  +  fa d0)  —  a0fa  djd0  >  0. 


We  now  make  the  inductive  hypothesis  that  z]dj  >  0  and  deduce  that  zj+ldj+ 1  >  0.  From 
(7.12),  we  have  zj+p'j+\  =  0,  and  therefore 

z]+idj+ 1  =  zJ+i(-0+i  +  Pi+idj) 

—  Pj+iZj+\dj 
-  Pj+ii.Zj  +  ajdj)T dj 
=  Pj+lzTjdj+ajpj+ldTjdj. 


Because  of  the  inductive  hypothesis  and  positivity  of  fa+\  and  aj,  the  last  expression  is 
positive. 

We  now  prove  the  theorem.  If  Algorithm  7.2  terminates  because  dj  B^dj  <  0  or 
Ikj+ilb  >  A k,  then  the  final  point  pi,  is  chosen  to  make  \\pkWi  =  A *,  which  is  the 
largest  possible  length.  To  cover  all  other  possibilities  in  the  algorithm,  we  must  show  that 
Ikj  Ik  <  Ikj+i  Ik  when  Zj+i  =  Zj  +  oijdj  and  j  >  1.  Observe  that 

Iky'+i  II 2  =  (z.j  +  ajdj )  (Zj  +  oijdj)  —  \\zj  ||2  +  2 oijZj  dj  +  ol j  || dj  ||2- 

It  follows  from  this  expression  and  our  intermediate  result  that  |k./  II 2  <  Ikj+ilk.  so  our 
proof  is  complete.  □ 

From  this  theorem  we  see  that  Algorithm  7.2  sweeps  out  points  z j  that  move  on  some 
interpolating  path  from  zi  to  the  final  solution  p^,  a  path  in  which  every  step  increases  its 
total  distance  from  the  start  point.  When  B *  =  V2  /)-  is  positive  definite,  this  path  may 
be  compared  to  the  path  of  the  dogleg  method:  Both  methods  start  by  minimizing  nik 
along  the  negative  gradient  direction  —V  and  subsequently  progress  toward  p\,  until  the 
trust-region  boundary  intervenes.  One  can  show  that,  when  Zfi.  =  V2/)-  is  positive  definite, 
Algorithm  7.2  provides  a  decrease  in  the  model  (7.11)  that  is  at  least  half  as  good  as  the 
optimal  decrease  [320]. 
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PRECONDITIONING  THE  TRUST-REGION  NEWTON-CG  METHOD 

As  discussed  in  Chapter  5,  preconditioning  can  be  used  to  accelerate  the  CG  iteration. 
Preconditioning  techniques  are  based  on  finding  a  nonsingular  matrix  D  such  that  the  eigen¬ 
values  of  Z)~r  V2  fkD~l  have  a  more  favorable  distribution.  By  generalizing  Theorem  7.3, 
we  can  show  that  the  iterates  z  j  generated  by  a  preconditioned  variant  of  Algorithm  7.2  will 
grow  monotonically  in  the  weighted  norm  ||  D  ■  || .  To  be  consistent,  we  should  redefine  the 
trust-region  subproblem  in  terms  of  the  same  norm,  as  follows: 

min  mk{p)  =  fk  +  'V  fkT P  +  \pT Bkp  subject  to  ||  Dp\\  <  Ak.  (7.13) 

pe  R“  z- 

Making  the  change  of  variables  p  —  Dp  and  defining 

gk  =  D~TV  fk,  Bk  =  D~T  {V2  fk)D~l , 

we  can  write  (7.13)  as 

min  fk  +  glp  +  \pT  Bkp  subject  to  ||p||  <  A, 
pe  R” 

which  has  exactly  the  form  of  (7. 1 1 ) .  We  can  apply  Algorithm  7.2  without  any  modification  to 
this  subproblem,  which  is  equivalent  to  applying  a  preconditioned  version  of  Algorithm  7.2 
to  the  problem  (7.13). 

Many  preconditioners  can  be  used  within  this  framework;  we  discuss  some  of  them 
in  Chapter  5.  Of  particular  interest  is  incomplete  Cholesky  factorization,  which  has  proved 
useful  in  a  wide  range  of  optimization  problems.  The  incomplete  Cholesky  factorization  of 
a  positive  definite  matrix  B  finds  a  lower  triangular  matrix  L  such  that 

B  =  LLt  -  R, 

where  the  amount  of  fill-in  in  L  is  restricted  in  some  way.  (For  instance,  it  is  constrained 
to  have  the  same  sparsity  structure  as  the  lower  triangular  part  of  B  or  is  allowed  to  have  a 
number  of  nonzero  entries  similar  to  that  in  B.)  The  matrix  R  accounts  for  the  inexactness 
in  the  approximate  factorization.  The  situation  is  complicated  somewhat  by  the  possible 
indefiniteness  of  the  Hessian  V2/).;  we  must  be  able  to  handle  this  indefiniteness  as  well  as 
maintain  the  sparsity.  The  following  algorithm  combines  incomplete  Cholesky  and  a  form 
of  modified  Cholesky  to  define  a  preconditioner  for  the  trust-region  Newton-CG  approach. 

Algorithm  7.3  (Inexact  Modified  Cholesky). 

Compute  T  —  diag(||Bei||,  || Be2\\, . . . ,  ||Be„||),  where  e;  is  the 
;  th  coordinate  vector; 

Set  B  < —  r1/2fi7’-1/2;Se t/3  a-  ||B||; 
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(compute  a  shift  to  ensure  positive  definiteness) 
if  min,-  bn  >  0 
ao  <-  0 

else 

a0  /S/2; 
for  k  —  0,  1,  2,  . . . 

Attempt  to  apply  incomplete  Cholesky  algorithm  to  obtain 


LLt  —  B  +  akI\ 

if  the  factorization  is  completed  successfully 
stop  and  return  L; 

else 

ak+i  x-  max(2 ak,  yS/2); 

end  (for) 

We  can  then  set  the  preconditioner  to  be  D  =  L 1 ,  where  L  is  the  lower  triangular  matrix 
output  from  Algorithm  7.3.  A  trust-region  Newton-CG  method  using  this  preconditioner 
is  implemented  in  the  Lancelot  [72]  and  tron  [192]  codes. 

TRUST-REGION  NEWTON-LANCZOS  METHOD 

A  limitation  of  Algorithm  7.2  is  that  it  accepts  any  direction  of  negative  curvature,  even 
when  this  direction  gives  an  insignificant  reduction  in  the  model.  Consider,  for  example, 
the  case  where  the  subproblem  (7.1 1)  is 

min  m(p)  —  10-3pi  —  10 ~4/?i  —  p\  subject  to  ||p||  <  1, 

p 

where  subscripts  indicate  elements  of  the  vector  p.  The  steepest  descent  direction  at  p  =  0  is 
(— 10-3,  0)r,  which  is  a  direction  of  negative  curvature  for  the  model.  Algorithm  7.2  would 
follow  this  direction  to  the  boundary  of  the  trust  region,  yielding  a  reduction  in  model 
function  m  of  about  10-3.  A  step  along  ei — also  a  direction  of  negative  curvature — would 
yield  a  much  greater  reduction  of  1. 

Several  remedies  have  been  proposed.  We  have  seen  in  Chapter  4  that  when  the 
Hessian  V2//  contains  negative  eigenvalues,  the  search  direction  should  have  a  significant 
component  along  the  eigenvector  corresponding  to  the  most  negative  eigenvalue  of  V2  //. 
This  feature  would  allow  the  algorithm  to  move  away  rapidly  from  stationary  points  that 
are  not  minimizers.  One  way  to  achieve  this  is  to  compute  a  nearly  exact  solution  of 
the  trust-region  subproblem  (7.11)  using  the  techniques  described  in  Section  4.3.  This 
approach  requires  the  solution  of  a  few  linear  systems  with  coefficient  matrices  of  the  form 
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Bk  +  'kl .  Although  this  approach  is  perhaps  too  expensive  in  the  large-scale  case,  it  generates 
productive  search  directions  in  all  cases. 

A  more  practical  alternative  is  to  use  the  Lanczos  method  (see,  for  example,  [136]) 
rather  than  the  CG  method  to  solve  the  linear  system  B k  p  —  —  V/*.  The  Lanczos  method 
can  be  seen  as  a  generalization  of  the  CG  method  that  is  applicable  to  indefinite  systems,  and 
we  can  use  it  to  continue  the  CG  process  while  gathering  negative  curvature  information. 

After  j  steps,  the  Lanczos  method  generates  an  n  x  j  matrix  Qj  with  orthogonal 
columns  that  span  the  Krylov  subspace  (5.15)  generated  by  this  method.  This  matrix  has 
the  property  that  QTjBQj  —  Tj ,  where  7}  is  an  tridiagonal.  We  can  take  advantage  of 
this  tridiagonal  structure  and  seek  to  find  an  approximate  solution  of  the  trust-region 
subproblem  in  the  range  of  the  basis  Qj.  To  do  so,  we  solve  the  problem 

mm  fk  +  e\ Qj(V fk)e\w  +  \wT Tjw  subjectto  ||iu||  <  A*,  (7.14) 

we  Rj 

where  e\  —  (1,  0,  0,  ... ,  0)r,  and  we  define  the  approximate  solution  of  the  trust-region 
subproblem  as  Pk  —  QjW.  Since  Tj  is  tridiagonal,  problem  (7. 14)  can  be  solved  by  factoring 
the  system  Tj  +  XI  and  following  the  (nearly)  exact  approach  of  Section  4.3. 

The  Lanczos  iteration  may  be  terminated,  as  in  the  Newton-CG  methods,  by  a  test  of 
the  form  (7.3).  Preconditioning  can  also  be  incorporated  to  accelerate  the  convergence  of 
the  Lanczos  iteration.  The  additional  robustness  in  this  trust-region  algorithm  comes  at  the 
cost  of  a  more  expensive  solution  of  the  subproblem  than  in  the  Newton-CG  approach.  A 
sophisticated  implementation  of  the  Newton-Lanczos  approach  has  been  implemented  in 
the  gltr  package  [  145] . 


7.2  LIMITED-MEMORY  QUASI-NEWTON  METHODS 


Limited-memory  quasi-Newton  methods  are  useful  for  solving  large  problems  whose  Hes¬ 
sian  matrices  cannot  be  computed  at  a  reasonable  cost  or  are  not  sparse.  These  methods 
maintain  simple  and  compact  approximations  of  Hessian  matrices:  Instead  of  storing  fully 
dense  n  x  n  approximations,  they  save  only  a  few  vectors  of  length  n  that  represent  the 
approximations  implicitly.  Despite  these  modest  storage  requirements,  they  often  yield  an 
acceptable  (albeit  linear)  rate  of  convergence.  Various  limited-memory  methods  have  been 
proposed;  we  focus  mainly  on  an  algorithm  known  as  L-BFGS,  which,  as  its  name  suggests, 
is  based  on  the  BFGS  updating  formula.  The  main  idea  of  this  method  is  to  use  curvature 
information  from  only  the  most  recent  iterations  to  construct  the  Hessian  approximation. 
Curvature  information  from  earlier  iterations,  which  is  less  likely  to  be  relevant  to  the  ac¬ 
tual  behavior  of  the  Hessian  at  the  current  iteration,  is  discarded  in  the  interest  of  saving 
storage. 

Following  our  discussion  of  L-BFGS  and  its  convergence  behavior,  we  discuss  its 
relationship  to  the  nonlinear  conjugate  gradient  methods  of  Chapter  5.  We  then  discuss 
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implementations  of  limited-memory  schemes  that  make  use  of  a  compact  representation  of 
approximate  Hessian  information.  These  techniques  can  be  applied  not  only  to  L-BFGS  but 
also  to  limited-memory  versions  of  other  quasi-Newton  procedures  such  as  SRI.  Finally, 
we  discuss  quasi-Newton  updating  schemes  that  impose  a  particular  sparsity  pattern  on  the 
approximate  Hessian. 


LIMITED-MEMORY  BFGS 

We  begin  our  description  of  the  L-BFGS  method  by  recalling  its  parent,  the  BFGS 
method,  which  was  described  in  Algorithm  8.1.  Each  step  of  the  BFGS  method  has  the  form 

xk+i  =  xk  -  ak Hk  V /* ,  (7.1 5) 

where  ak  is  the  step  length  and  Hk  is  updated  at  every  iteration  by  means  of  the  formula 

Hk+ 1  =  Vl  Hk\ 4  +  pksksl  (7.16) 


(see  (6.17)),  where 


1  y 

Pk  —  /  >  yk  — ■  f  Pt  ykxk , 

yk  sk 


(7.17) 


and 


Sk  =  xk+ 1  -  xk,  yk  -  V/*+1  -  V/,.  (7.18) 

Since  the  inverse  Hessian  approximation  Hk  will  generally  be  dense,  the  cost  of  storing 
and  manipulating  it  is  prohibitive  when  the  number  of  variables  is  large.  To  circumvent  this 
problem,  we  store  a  modified  version  of  Hk  implicitly,  by  storing  a  certain  number  (say,  m) 
of  the  vector  pairs  {sj,  >’,  }  used  in  the  formulas  (7.16)— (7.18).  The  product  HkV  fk  can  be 
obtained  by  performing  a  sequence  of  inner  products  and  vector  summations  involving 
V  fk  and  the  pairs  {.q ,  y,- } .  After  the  new  iterate  is  computed,  the  oldest  vector  pair  in  the  set 
of  pairs  >’,  }  is  replaced  by  the  new  pair  { sk ,  yk}  obtained  from  the  current  step  (7.18). 
In  this  way,  the  set  of  vector  pairs  includes  curvature  information  from  the  m  most  recent 
iterations.  Practical  experience  has  shown  that  modest  values  of  m  (between  3  and  20,  say) 
often  produce  satisfactory  results. 

We  now  describe  the  updating  process  in  a  little  more  detail.  At  iteration  k,  the  current 

iterate  is  xk  and  the  set  of  vector  pairs  is  given  by  {y,  y,}  for  i  —  k  —  m . k  —  1 .  We  first 

choose  some  initial  Hessian  approximation  H°  (in  contrast  to  the  standard  BFGS  iteration, 
this  initial  approximation  is  allowed  to  vary  from  iteration  to  iteration)  and  find  by  repeated 
application  of  the  formula  (7.16)  that  the  L-BFGS  approximation  Hk  satisfies  the  following 
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formula: 


Hk  =  {vl,  ■  ■  ■  Vkr_m)  H°k  (y,_m  • . .  Vt_!) 

+  Pk-m  (Yk- 1  '  '  '  ^4-m+l)  sk-msk-m  iYk-m+l  '  '  '  V/t-l) 

+  Pk-m+ 1  (^fc— l  '  '  '  Vk-m+l)  Sk—m+lsk-m+\  i^k—m+2  '  '  '  14— l) 

+  ••• 

-fpt-iii-isL.  (7.19) 

From  this  expression  we  can  derive  a  recursive  procedure  to  compute  the  product  HkV  fk 
efficiently. 

Algorithm  7.4  (L-BFGS  two-loop  recursion). 

q  <-  V/*; 

for  i  —  k  —  l,k  —  2, . . .  ,k  —  m 
at  <-  Pisfq; 
q  <-  q  -  onyr, 

end  (for) 

r  <-  H°q ; 

for  i  —  k  —  m,  k  —  m  +  1, —  1 

P  •<-  Pi 

r  -a-  r  +  ii(ai  -  /6) 

end  (for) 

stop  with  result  //j- V  fk  —  r. 

Without  considering  the  multiplication  77°^,  the  two-loop  recursion  scheme  requires 
Amn  multiplications;  if  77°  is  diagonal,  then  n  additional  multiplications  are  needed.  Apart 
from  being  inexpensive,  this  recursion  has  the  advantage  that  the  multiplication  by  the 
initial  matrix  77°  is  isolated  from  the  rest  of  the  computations,  allowing  this  matrix  to  be 
chosen  freely  and  to  vary  between  iterations.  We  may  even  use  an  implicit  choice  of  77°  by 
defining  some  initial  approximation  7?°  to  the  Hessian  (not  its  inverse)  and  obtaining  r  by 
solving  the  system  B°r  —  q. 

A  method  for  choosing  77°  that  has  proved  effective  in  practice  is  to  set  77°  =  yk  7, 

where 


Yk 


sI-iyk-i 

y[  i.v*  i  * 


(7.20) 


As  discussed  in  Chapter  6,  yk  is  the  scaling  factor  that  attempts  to  estimate  the  size  of  the 
true  Hessian  matrix  along  the  most  recent  search  direction  (see  (6.21)).  This  choice  helps 
to  ensure  that  the  search  direction  pk  is  well  scaled,  and  as  a  result  the  step  length  ak  =  1  is 
accepted  in  most  iterations.  As  discussed  in  Chapter  6,  it  is  important  that  the  line  search  be 
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based  on  the  Wolfe  conditions  (3.6)  or  strong  Wolfe  conditions  (3.7),  so  that  BFGS  updating 
is  stable. 

The  limited-memory  BFGS  algorithm  can  be  stated  formally  as  follows. 

Algorithm  7.5  (L -BFGS). 

Choose  starting  point  x0,  integer  m  >  0; 

k  < —  0; 

repeat 

Choose  H®  (for  example,  by  using  (7.20)); 

Compute  pk  < - HkV  fk  from  Algorithm  7.4; 

Compute  Xk+\  •«—  Xk  +  otk Pk>  where  a.k  is  chosen  to 
satisfy  the  Wolfe  conditions; 
if  k  >  m 

Discard  the  vector  pair  ,  yk-m }  from  storage; 

Compute  and  save  sk  <-  xk+\  -  xk,  yk  —  V  fk+i  ~  V  fk; 
k  < —  k  -p  1; 

until  convergence. 

The  strategy  of  keeping  the  m  most  recent  correction  pairs  {s,- ,  y,- }  works  well  in 
practice;  indeed  no  other  strategy  has  yet  proved  to  be  consistently  better.  During  its  first 
m  —  1  iterations,  Algorithm  7.5  is  equivalent  to  the  BFGS  algorithm  of  Chapter  6  if  the 
initial  matrix  H0  is  the  same  in  both  methods,  and  if  L-BFGS  chooses  H®  —  H0  at  each 
iteration. 

Table  7.1  presents  results  illustrating  the  behavior  of  Algorithm  7.5  for  various  levels 
of  memory  m.  It  gives  the  number  of  function  and  gradient  evaluations  (nfg)  and  the  total 
CPU  time.  The  test  problems  are  taken  from  the  CUTE  collection  [35],  the  number  of 
variables  is  indicated  by  n,  and  the  termination  criterion  ||  V  fk  ||  <  1CU5  is  used.  The  table 
shows  that  the  algorithm  tends  to  be  less  robust  when  m  is  small.  As  the  amount  of  storage 
increases,  the  number  of  function  evaluations  tends  to  decrease;  but  since  the  cost  of  each 
iteration  increases  with  the  amount  of  storage,  the  best  CPU  time  is  often  obtained  for  small 
values  of  m .  Clearly,  the  optimal  choice  of  m  is  problem  dependent. 

Because  some  rival  algorithms  are  inefficient,  Algorithm  7.5  is  often  the  approach  of 
choice  for  large  problems  in  which  the  true  Hessian  is  not  sparse.  In  particular,  a  Newton 


Table  7.1  Performance  of  Algorithm  7.5. 


Problem 

n 

L-BFGS 

m  =  3 

L-BFGS 

m  =  5 

L-BFGS 

m  =  17 

L-BFGS 

m  =  29 

nfg 

time 

nfg 

time 

nfg 

time 

nfg 

time 

DIXMAANL 

1500 

146 

16.5 

134 

17.4 

120 

28.2 

125 

44.4 

EIGENALS 

110 

821 

21.5 

569 

15.7 

363 

16.2 

168 

12.5 

FREUROTH 

1000 

>999 

— 

>999 

— 

69 

8.1 

38 

6.3 

TRIDIA 

1000 

876 

46.6 

611 

41.4 

531 

84.6 

462 

127.1 

179 
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method  in  which  the  exact  Hessian  is  computed  and  factorized  is  not  practical  in  such 
circumstances.  The  L-BFGS  approach  may  also  outperform  Hessian-free  Newton  methods 
such  as  Newton-CG  approaches,  in  which  Hessian-vector  products  are  calculated  by  finite 
differences  or  automatic  differentiation.  The  main  weakness  of  the  L-BFGS  method  is  that  it 
converges  slowly  on  ill-conditioned  problems — specifically,  on  problems  where  the  Hessian 
matrix  contains  a  wide  distribution  of  eigenvalues.  On  certain  applications,  the  nonlinear 
conjugate  gradient  methods  discussed  in  Chapter  5  are  competitive  with  limited-memory 
quasi-Newton  methods. 


RELATIONSHIP  WITH  CONJUGATE  GRADIENT  METHODS 

Limited-memory  methods  evolved  as  an  attempt  to  improve  nonlinear  conjugate 
gradient  methods,  and  early  implementations  resembled  conjugate  gradient  methods  more 
than  quasi-Newton  methods.  The  relationship  between  the  two  classes  is  the  basis  of  a 
memoryless  BFGS  iteration,  which  we  now  outline. 

We  start  by  considering  the  Hestenes-Stiefel  form  of  the  nonlinear  conjugate  gradient 
method  (5.46).  Recalling  that  ,s>  =  otkPk,  we  have  that  the  search  direction  for  this  method 
is  given  by 


Pk+ 1  —  —V  fk+ 1  + 


v/4  ,yk 

- f - Pk  =  - 

y’k  Pk 


v fk+i  -  -4+iV/*+1. 


(7.21) 


This  formula  resembles  a  quasi-Newton  iteration,  but  the  matrix  Hk+i  is  neither  symmetric 
nor  positive  definite.  We  could  symmetrize  it  as  H^+1Hk+i,  but  this  matrix  does  not  satisfy 
the  secant  equation  Hk+\yk  —  $k  and  is,  in  any  case,  singular.  An  iteration  matrix  that  is 
symmetric,  positive  definite,  and  satisfies  the  secant  equation  is  given  by 


Hk+ 1 


V  yksk/\  ykskJ  yksk 


(7.22) 


This  matrix  is  exactly  the  one  obtained  by  applying  a  single  BFGS  update  (7. 1 6)  to  the  identity 
matrix.  Hence,  an  algorithm  whose  search  direction  is  given  by  pk+\  —  —Hk+i^fk+u  with 
L4+i  defined  by  (7.22),  can  be  thought  of  as  a  “memoryless”  BFGS  method,  in  which  the 
previous  Hessian  approximation  is  always  reset  to  the  identity  matrix  before  updating  it  and 
where  only  the  most  recent  correction  pair  (s>,  y*)  is  kept  at  every  iteration.  Alternatively, 
we  can  view  the  method  as  a  variant  of  Algorithm  7.5  in  which  m  —  1  and  H°  —  I  at  each 
iteration. 

A  more  direct  connection  with  conjugate  gradient  methods  can  be  seen  if  we  consider 
the  memoryless  BFGS  formula  (7.22)  in  conjunction  with  an  exact  line  search,  for  which 
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V  f?pk  —  0  for  all  k.  We  then  obtain 


Pk+i  =  ~Hk+lVfk+l  =  -V/,+1  +  jkt+w  pk,  (7.23) 

yk  Pk 

which  is  none  other  than  the  Hestenes-Stiefel  conjugate  gradient  method.  Moreover,  it  is 
easy  to  verify  that  when  V  ff+lPk  —  0,  the  Hestenes-Stiefel  formula  reduces  to  the  Polak- 
Ribiere  formula  (5.44).  Even  though  the  assumption  of  exact  line  searches  is  unrealistic, 
it  is  intriguing  that  the  BFGS  formula  is  related  in  this  way  to  the  Polak-Ribiere  and 
Hestenes-Stiefel  methods. 

GENERAL  LIMITED-MEMORY  UPDATING 

Limited-memory  quasi-Newton  approximations  are  useful  in  a  variety  of  optimization 
methods.  L-BFGS,  Algorithm  7.5,  is  a  line  search  method  for  unconstrained  optimization 
that  (implicitly)  updates  an  approximation  74  to  the  inverse  of  the  Hessian  matrix.  Trust- 
region  methods,  on  the  other  hand,  require  an  approximation  B \  to  the  Hessian  matrix, 
not  to  its  inverse.  We  would  also  like  to  develop  limited-memory  methods  based  on  the  SRI 
formula,  which  is  an  attractive  alternative  to  BFGS;  see  Chapter  6.  In  this  section  we  consider 
limited-memory  updating  in  a  general  setting  and  show  that  by  representing  quasi-Newton 
matrices  in  a  compact  (or  outer  product)  form,  we  can  derive  efficient  implementations  of  all 
popular  quasi-Newton  update  formulas,  and  their  inverses.  These  compact  representations 
will  also  be  useful  in  designing  limited-memory  methods  for  constrained  optimization, 
where  approximations  to  the  Hessian  or  reduced  Hessian  of  the  Fagrangian  are  needed;  see 
Chapter  18  and  Chapter  19. 

We  will  consider  only  limited-memory  methods  (such  as  L-BFGS)  that  continuously 
refresh  the  correction  pairs  by  removing  and  adding  information  at  each  stage.  A  different 
approach  saves  correction  pairs  until  the  available  storage  is  exhausted  and  then  discards  all 
correction  pairs  (except  perhaps  one)  and  starts  the  process  anew.  Computational  experience 
suggests  that  this  second  approach  is  less  effective  in  practice. 

Throughout  this  chapter  we  let  74  denote  an  approximation  to  a  Hessian  matrix  and 
Hk  the  approximation  to  the  inverse.  In  particular,  we  always  have  that  Bf1  —  74. 

COMPACT  REPRESENTATION  OF  BFGS  UPDATING 

We  now  describe  an  approach  to  limited-memory  updating  that  is  based  on  repre¬ 
senting  quasi-Newton  matrices  in  outer-product  form.  We  illustrate  it  for  the  case  of  a  BFGS 
approximation  74  to  the  Hessian. 

Theorem  7.4. 

Let  B0  be  symmetric  and  positive  definite,  and  assume  that  the  k  vector  pairs  {s,- ,  y, 
satisfy  sf  y(-  >  0.  Let  B be  obtained  by  applying  k  BFGS  updates  with  these  vector  pairs  to  B0, 
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using  the  formula  (6.19).  We  then  have  that 


Sk  B0Sk 

Lk 

-1 

1 

o 

_ 1 

Tt 

L"k 

-Dk  _ 

L  Yk  J 

Bk  —  B0  —  \  B0Sk  Yk  ] 

where  Sk  and  Yk  are  the  n  x  k  matrices  defined  by 

Sk  =  [s0,  . . . ,  sh]  ,  Yk  —  [y0 - -  yk~\]  , 

while  Lk  and  Dk  are  the  k  x  k  matrices 

,T  v  [  sLiyj-i  ifi  >  J> 

I  0  otherwise, 

Dk  =  diag  [j0r  yo, • 


(7.24) 


(7.25) 


(7.26) 

(7.27) 


This  result  can  be  proved  by  induction.  We  note  that  the  conditions  sj  yi  >  0 ,  i  = 
0,  1, . . . ,  k  —  1,  ensure  that  the  middle  matrix  in  (7.24)  is  nonsingular,  so  that  this  expres¬ 
sion  is  well  defined.  The  utility  of  this  representation  becomes  apparent  when  we  consider 
limited-memory  updating. 

As  in  the  L-BFGS  algorithm,  we  keep  the  m  most  recent  correction  pairs  {s,- ,  y\ }  and 
refresh  this  set  at  every  iteration  by  removing  the  oldest  pair  and  adding  a  newly  generated 
pair.  During  the  first  m  iterations,  the  update  procedure  described  in  Theorem  7.4  can  be 
used  without  modification,  except  that  usually  we  make  the  specific  choice  B°  =  Sk  I  for 
the  basic  matrix,  where  8k  —  1  /yk  and  yk  is  defined  by  (7.20). 

At  subsequent  iterations  k  >  in,  the  update  procedure  needs  to  be  modified  slightly  to 
reflect  the  changing  nature  ofthe  set  ofvector  pairs  {s;,  y/Jfor/'  =  k—m,  k—m+l,  . . . ,  k—  1. 
Defining  the  n  x  m  matrices  Sk  and  Yk  by 


Sk  —  \sk—m,  ....  Sk—  l]  ,  Yk  —  \yk—m,  ■  •  -  ,  yk— l]  , 


(7.28) 


we  find  that  the  matrix  Bk  resulting  from  m  updates  to  the  basic  matrix  B ^  —  SkI  is  given 
by 


sksk  sk 

Lk 

-1 

'  hsTk  ' 

1T 

‘-k 

-Dk  _ 

.  n  . 

Bk  —  SkI  —  [  8k  Sk  Yk  ] 

where  Lk  and  Dk  are  now  the  m  x  m  matrices  defined  by 

.  I  i^k—m—l+i)  ( yk—m  —  l+j )  If  1  P*  j > 

(Lk)i,j  =  1  , 

I  0  otherwise, 

Dk  =  diag  [sk_myk-m , ...,  41,>Wi] . 


(7.29) 
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Figure  7.1 

Compact  (or  outer 
product)  representation  of 
Bk  in  (7.29). 


After  the  new  iterate  xk+i  is  generated,  we  obtain  Sk+  \  by  deleting  sk-m  from  Sk  and  adding 
the  new  displacement  sk,  and  we  update  Yk+\  in  a  similar  fashion.  The  new  matrices  Lk+ 1 
and  Dk+i  are  obtained  in  an  analogous  way. 

Since  the  middle  matrix  in  (7.29)  is  small — of  dimension  2m — its  factorization  re¬ 
quires  a  negligible  amount  of  computation.  The  key  idea  behind  the  compact  representation 
(7.29)  is  that  the  corrections  to  the  basic  matrix  can  be  expressed  as  an  outer  product  of  two 
long  and  narrow  matrices —  [  8k  Sk  Yk  ]  and  its  transpose — with  an  intervening  multiplication 
by  a  small  2m  x  2m  matrix.  See  Figure  7.1  for  a  graphical  illustration. 

The  limited-memory  updating  procedure  of  Bk  requires  approximately  2mn  +  0{m 3) 
operations,  and  matrix-vector  products  of  the  form  Bkv  can  be  performed  at  a  cost  of 
{Am  +  1  )n  +  0(m2)  multiplications.  These  operation  counts  indicate  that  updating  and 
manipulating  the  direct  limited-memory  BFGS  matrix  Bk  is  quite  economical  when  m  is 
small. 

This  approximation  Bk  can  be  used  in  a  trust-region  method  for  unconstrained  opti¬ 
mization  or,  more  significantly,  in  methods  for  bound-constrained  and  general-constrained 
optimization.  The  program  l-bfgs-b  [322]  makes  extensive  use  of  compact  limited-memory 
approximations  to  solve  large  nonlinear  optimization  problems  with  bound  constraints.  In 
this  situation,  projections  of  Bk  into  subspaces  defined  by  the  constraint  gradients  must  be 
calculated  repeatedly.  Several  codes  for  general-constrained  optimization,  including  knitro 
and  ipopt,  make  use  of  the  compact  limited-memory  matrix  Bk  to  approximate  the  Hessian 
of  the  Lagrangians;  see  Section  19.3 

We  can  derive  a  formula,  similar  to  (7.24),  that  provides  a  compact  representation 
of  the  inverse  BFGS  approximation  Hk;  see  [52]  for  details.  An  implementation  of  the 
unconstrained  L-BFGS  algorithm  based  on  this  expression  requires  a  similar  amount  of 
computation  as  the  algorithm  described  in  the  previous  section. 

Compact  representations  can  also  be  derived  for  matrices  generated  by  the  symmetric 
rank-one  (SRI)  formula.  If  k  updates  are  applied  to  the  symmetric  matrix  B0  using  the 
vector  pairs  {s,- ,  y,-  and  the  SRI  formula  (6.24),  the  resulting  matrix  Bk  can  be  expressed 
as 


Bk  =  B0  +  (Yk  -  B0Sk){Dk  +  Lk  +  L\  -  Sf  B0Sk)~l(Yk  -  B0Sk)T , 


(7.30) 
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where  Sk,  T, k,  D k,  and  L,;.  are  as  defined  in  (7.25),  (7.26),  and  (7.27).  Since  the  SRI  method 
is  self-dual,  the  inverse  formula  Hk  can  be  obtained  simply  by  replacing  B,  5,  and  y  by  //, 
y,  and  s,  respectively.  Limited-memory  SRI  methods  can  be  derived  in  the  same  way  as  the 
BFGS  method.  We  replace  Bo  with  the  basic  matrix  Bk  at  the  £th  iteration,  and  we  redefine 
Sk  and  Yk  to  contain  the  m  most  recent  corrections,  as  in  (7.28).  We  note,  however,  that 
limited-memory  SRI  updating  is  sometimes  not  as  effective  as  L-BFGS  updating  because  it 
may  not  produce  positive  definite  approximations  near  a  solution. 


UNROLLING  THE  UPDATE 

The  reader  may  wonder  whether  limited-memory  updating  can  be  implemented 
in  simpler  ways.  In  fact,  as  we  show  here,  the  most  obvious  implementation  of  limited- 
memory  BFGS  updating  is  considerably  more  expensive  than  the  approach  based  on  compact 
representations  discussed  in  the  previous  section. 

The  direct  BFGS  formula  (6.19)  can  be  written  as 


Bk+i  —  Bk  —  akal  +  bkbTk  , 


(7.31) 


where  the  vectors  ak  and  bk  are  defined  by 


Ok 


Bksk 

{si BkSk )5  ’ 


(7.32) 


We  could  continue  to  save  the  vector  pairs  {s,-,  y,-}  but  use  the  formula  (7.31)  to  compute 
matrix-vector  products.  A  limited-memory  BFGS  method  that  uses  this  approach  would 
proceed  by  defining  the  basic  matrix  Bk  at  each  iteration  and  then  updating  according  to 
the  formula 


B/f  =  B°k  +  ^2  [pibJ  ~  aiaI]  ■ 

i=k—m 


(7.33) 


The  vector  pairs  {a,,  bj},  i  —  k  —  m,k  —  m  +  1, ...  ,k  —  1,  would  then  be  recovered  from 
the  stored  vector  pairs  {.S’,-,  y,  },  i  =  k  —  m ,  k  —  m  +  1 , . . . ,  k  —  1 ,  by  the  following  procedure: 

Procedure  7.6  (Unrolling  the  BFGS  formula), 
for  i  —  k  —  m,  k  —  m+1, —  1 
bt  <-  yi/iylsi)1'2; 

at  Bks,  +  \ibTjSi)bj  -  (ajs,-)^-]; 

a,-  di/{sj ai)l/2; 

end  (for) 
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Note  that  the  vectors  a,  must  be  recomputed  at  each  iteration  because  they  all  depend 
on  the  vector  pair  {sk~m.  yk-m),  which  is  removed  at  the  end  of  iteration  k.  On  the  other 
hand,  the  vectors  b ,•  and  the  inner  products  bT-  s,  can  be  saved  from  the  previous  iteration, 
so  only  the  new  values  bk- 1  and  bTjSk-\  need  to  be  computed  at  the  current  iteration. 

By  taking  all  these  computations  into  account,  and  assuming  that  —  /,  we  find 
that  approximately  |  m2n  operations  are  needed  to  determine  the  limited-memory  matrix. 
The  actual  computation  of  the  inner  product  Bm  v  (for  arbitrary  v  e  IR" )  requires  4m n 
multiplications.  Overall,  therefore,  this  approach  is  less  efficient  than  the  one  based  on  the 
compact  matrix  representation  described  previously.  Indeed,  while  the  product  BkV  costs 
the  same  in  both  cases,  updating  the  representation  of  the  limited-memory  matrix  by  using 
the  compact  form  requires  only  2 mn  multiplications,  compared  to  |m2n  multiplications 
needed  when  the  BFGS  formula  is  unrolled. 


7.3  SPARSE  QUASI-NEWTON  UPDATES 


We  now  discuss  a  quasi-Newton  approach  to  large-scale  problems  that  has  intuitive  appeal: 
We  demand  that  the  quasi-Newton  approximations  Bk  have  the  same  (or  similar)  sparsity 
pattern  as  the  true  Hessian.  This  approach  would  reduce  the  storage  requirements  of  the 
algorithm  and  perhaps  give  rise  to  more  accurate  Hessian  approximations. 

Suppose  that  we  know  which  components  of  the  Hessian  may  be  nonzero  at  some 
point  in  the  domain  of  interest.  That  is,  we  know  the  contents  of  the  set  Q  defined  by 

Q  =f  {(/  j)  |  [V2/(.r)],7  /  0  for  some  x  in  the  domain  of  /}. 

Suppose  also  that  the  current  Hessian  approximation  Bk  mirrors  the  nonzero  structure  of 
the  exact  Hessian,  that  is,  (Bk)ij  =  0  for  (i,  j)  /  Q.  In  updating  Bk  to  Bk+ 1,  then,  we 
could  try  to  find  the  matrix  Bk+i  that  satisfies  the  secant  condition,  has  the  same  sparsity 
pattern,  and  is  as  close  as  possible  to  Bk.  Specifically,  we  define  Bk+\  to  be  the  solution  of 
the  following  quadratic  program: 

rnin  \\B  -  Bk\\2F  =  ^  [Bu  -  (Bk)u]2,  (7.34a) 

subject  to  Bsk  —  y,t,  B  =  BT ,  and  //,  =  0  for  (i,  j )  ^  Q.  (7.34b) 

One  can  show  that  the  solution  Bk+\  of  this  problem  can  be  obtained  by  solving  an  n  x  n 
linear  system  whose  sparsity  pattern  is  Q,  the  same  as  the  sparsity  of  the  true  Hessian.  Once 
Bk+ 1  has  been  computed,  we  can  use  it,  within  a  trust-region  method,  to  obtain  the  new 
iterate  xa+i-  We  note  that  Bk+\  is  not  guaranteed  to  be  positive  definite. 

We  omit  further  details  of  this  approach  because  it  has  several  drawbacks.  The  updating 
process  does  not  possess  scale  invariance  under  linear  transformations  of  the  variables  and, 
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more  significantly,  its  practical  performance  has  been  disappointing.  The  fundamental 
weakness  of  this  approach  is  that  (7.34a)  is  an  inadequate  model  and  can  produce  poor 
Hessian  approximations. 

An  alternative  approach  is  to  relax  the  secant  equation,  making  sure  that  it  is  approx¬ 
imately  satisfied  along  the  last  few  steps  rather  than  requiring  it  to  hold  strictly  on  the  latest 
step.  To  do  so,  we  define  Sk  and  Tt  by  (7.28)  so  that  they  contain  the  m  most  recent  difference 
pairs.  We  can  then  define  the  new  Hessian  approximation  Bk+i  to  be  the  solution  of 

min  || BSk  -  Yk\\2F 

B 

subject  to  B  =  Bt  and  B,j  =  0  for  ( i ,  j)  £  £2. 

This  convex  optimization  problem  has  a  solution,  but  it  is  not  easy  to  compute.  Moreover, 
this  approach  can  produce  singular  or  poorly  conditioned  Hessian  approximations.  Even 
though  it  frequently  outperforms  methods  based  on  (7.34a),  its  performance  on  large 
problems  has  not  been  impressive. 


7.4  ALGORITHMS  FOR  PARTIALLY  SEPARABLE  FUNCTIONS 


In  a  separable  unconstrained  optimization  problem,  the  objective  function  can  be  decom¬ 
posed  into  a  sum  of  simpler  functions  that  can  be  optimized  independently.  For  example,  if 
we  have 


fix)  =  fdxi,  x3)  +  f2{x2,  x4,  x6)  +  f3(x5 ), 

we  can  find  the  optimal  value  of  a  by  minimizing  each  function  fn  i  =  1,  2,  3,  indepen¬ 
dently,  since  no  variable  appears  in  more  than  one  function.  The  cost  of  performing  m 
lower-dimensional  optimizations  is  much  less  in  general  than  the  cost  of  optimizing  an 
77 -dimensional  function. 

In  many  large  problems  the  objective  function  /  :  R"  — >  R  is  not  separable,  but 
it  can  still  be  written  as  the  sum  of  simpler  functions,  known  as  element  functions.  Each 
element  function  has  the  property  that  it  is  unaffected  when  we  move  along  a  large  number 
of  linearly  independent  directions.  If  this  property  holds,  we  say  that  /  is  partially  separable. 
All  functions  whose  Hessians  V2/  are  sparse  are  partially  separable,  but  so  are  many 
functions  whose  Hessian  is  not  sparse.  Partial  separability  allows  for  economical  problem 
representation,  efficient  automatic  differentiation,  and  effective  quasi-Newton  updating. 

The  simplest  form  of  partial  separability  arises  when  the  objective  function  can  be 
written  as 


fix)  =  £/)(*), 

i= 1 


(7.35) 
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where  each  of  the  element  functions  f  depends  on  only  a  few  components  of  x.  It  follows 
that  the  gradients  V/)  and  Hessians  V2/)  of  each  element  function  contain  just  a  few 
nonzeros.  By  differentiating  (7.35),  we  obtain 


ne  ne 

v/u)  =  y2/u)  =  v2fiW- 

/=i  i=i 

A  natural  question  is  whether  it  is  more  effective  to  maintain  quasi-Newton  approximations 
to  each  of  the  element  Hessians  V2/)(x)  separately,  rather  than  approximating  the  entire 
Hessian  V  2  f  (x ) .  We  will  show  that  the  answer  is  affirmative,  provided  that  the  quasi-Newton 
approximation  fully  exploits  the  structure  of  each  element  Hessian. 

We  introduce  the  concept  by  means  of  a  simple  example.  Consider  the  objective 
function 


f{x)  —  (xj  -  X3)2  +  (x2  -  X4)2  +  (x3  -  Xj)2  +  (x4  -  x2)2  (7.36) 

=  fiW  +  h(x)  +  /3(x)  +  /4(x). 

The  Hessians  of  the  element  functions  f  are  4x4  sparse,  singular  matrices  with  4  nonzero 
entries. 

Let  us  focus  on  /);  all  other  element  functions  have  exactly  the  same  form.  Even 
though  fi  is  formally  a  function  of  all  components  of  x,  it  depends  only  on  X\  and  x3,  which 
we  call  the  element  variables  for  f  \ .  We  assemble  the  element  variables  into  a  vector  that  we 
callx[ij,  that  is, 


*[i)  = 


Xl 

x3 


and  note  that 


X[\]  =  U\X  with  U\ 


10  0  0 
0  0  10 


If  we  define  the  function  (pi  by 


<Pi(zi,  Z2)  —  (zi  -  Z2)2’ 


then  we  can  write  f\{x)  —  <p\{U\x).  By  applying  the  chain  rule  to  this  representation,  we 
obtain 


V/1(x)  =  f/1rV0I(t/1x), 


V2/i(x)  =  C/1rV201(t/1x)f/1. 


(7.37) 
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In  our  case,  we  have 

2  0  —4x3  0 

0  0  0  0 

4x3  0  12x3  —  4xi  0 

0  0  0  0 

The  matrix  U 1 ,  known  as  a  compactifyitig  matrix,  allows  us  to  map  the  derivative  information 
for  the  low-dimensional  function  </p  into  the  derivative  information  for  the  element  function 

/i- 

Now  comes  the  key  idea:  Instead  of  maintaining  a  quasi-Newton  approximation  to 
V2/i,  we  maintain  a  2  x  2  quasi-Newton  approximation  Bp]  of  V2</>i  and  use  the  relation 
(7.37)  to  transform  it  into  a  quasi-Newton  approximation  to  V2/\.  To  update  Bp]  after  a 
typical  step  from  x  to  x+,  we  record  the  information 

J[i]  =*p]  ~X[t],  T[i]  =  V0i(xpj)  —  V0i(x[ij),  (7.38) 

and  use  BFGS  or  SRI  updating  to  obtain  the  new  approximation  B^ .  We  therefore  update 
small,  dense  quasi-Newton  approximations  with  the  property 

Bm  «  V2MUix)  =  V2</)1(x[1|).  (7.39) 

To  obtain  an  approximation  of  the  element  Hessian  V2/i,  we  use  the  transformation 
suggested  by  the  relationship  (7.37);  that  is, 

V2/i(x)  £» 

This  operation  has  the  effect  of  mapping  the  elements  of  Bp]  to  the  correct  positions  in  the 
full  n  x  n  Hessian  approximation. 

The  previous  discussion  concerned  only  the  first  element  function  /j ,  but  we  can  treat 
all  other  functions  fj  in  the  same  way.  The  full  objective  function  can  now  be  written  as 

ne 

/(X)  =  £&(£/, -X),  (7.40) 

i= 1 

and  we  maintain  a  quasi-Newton  approximation  Bp]  for  each  of  the  functions  (pi.  To  obtain 
a  complete  approximation  to  the  full  Hessian  V2/,  we  simply  sum  the  element  Hessian 
approximations  as  follows: 

ne 

b  =  J2u’b\‘]u- 

i= 1 


(7.41) 
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We  may  use  this  approximate  Hessian  in  a  trust-region  algorithm,  obtaining  an 
approximate  solution  pk  of  the  system 


BkPk  —  —  V/j.  (7.42) 

We  need  not  assemble  Bk  explicitly  but  rather  use  the  conjugate  gradient  approach  to  solve 
(7.42),  computing  matrix-vector  products  of  the  form  BkV  by  performing  operations  with 
the  matrices  U,  and  By], 

To  illustrate  the  usefulness  of  this  element-by-element  updating  technique,  let  us 
consider  a  problem  of  the  form  (7.36)  but  this  time  involving  1000  variables,  not  just  4.  The 
functions  </>,■  still  depend  on  only  two  internal  variables,  so  that  each  Hessian  approximation 
B[i]  is  a  2  x  2  matrix.  After  just  a  few  iterations,  we  will  have  sampled  enough  directions 
S[i]  to  make  each  By]  an  accurate  approximation  to  V20,-.  Hence  the  full  quasi-Newton 
approximation  (7.41)  will  tend  to  be  a  very  good  approximation  to  V2/(x).  By  contrast,  a 
quasi-Newton  method  that  ignores  the  partially  separable  structure  of  the  objective  function 
will  attempt  to  estimate  the  total  average  curvature — the  sum  of  the  individual  curvatures 
of  the  element  functions — by  approximating  the  1000  x  1000  Hessian  matrix.  When  the 
number  of  variables  n  is  large,  many  iterations  will  be  required  before  this  quasi-Newton 
approximation  is  of  good  quality.  Hence  an  algorithm  of  this  type  (for  example,  standard 
BFGS  or  L-BFGS)  will  require  many  more  iterations  than  a  method  based  on  the  partially 
separable  approximate  Hessian. 

It  is  not  always  possible  to  use  the  BFGS  formula  to  update  the  partial  Hessian  £[,■], 
because  there  is  no  guarantee  that  the  curvature  condition  s'^yy]  >  0  will  be  satisfied.  That 
is,  even  though  the  full  Hessian  V2/(x)  is  at  least  positive  semidefinite  at  the  solution  x*, 
some  of  the  individual  Hessians  V20;  (•)  may  be  indefinite.  One  way  to  overcome  this  obstacle 
is  to  apply  the  SRI  update  to  each  of  the  element  Hessians.  This  approach  has  proved  effective 
in  the  Lancelot  package  [72] ,  which  is  designed  to  take  full  advantage  of  partial  separability. 

The  main  limitations  of  this  quasi-Newton  approach  are  the  cost  of  the  step  computa¬ 
tion  (7.42),  which  is  comparable  to  the  cost  of  a  Newton  step,  and  the  difficulty  of  identifying 
the  partially  separable  structure  of  a  function.  The  performance  of  quasi-Newton  methods 
is  satisfactory  provided  that  we  find  the  finest  partially  separable  decomposition  of  the 
problem;  see  [72].  Furthermore,  even  when  the  partially  separable  structure  is  known,  it 
may  be  more  efficient  to  compute  a  Newton  step.  For  example,  the  modeling  language  ampl 
automatically  detects  the  partially  separable  structure  of  a  function  /  and  uses  it  to  compute 
the  Hessian  V2/(x). 


7.5  PERSPECTIVES  AND  SOFTWARE 


Newton-CG  methods  have  been  used  successfully  to  solve  large  problems  in  a  vari¬ 
ety  of  applications.  Many  of  these  implementations  are  developed  by  engineers  and 
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scientists  and  use  problem-specific  preconditioners.  Freely  available  packages  include 
tn/tnbc  [220]  and  tnpack  [275].  Software  for  more  general  problems,  such  as  Lancelot 
[72],  knitro/cg  [50],  and  tron  [192],  employ  Newton-CG  methods  when  applied  to 
unconstrained  problems.  Other  packages,  such  as  loqo  [294]  implement  Newton  meth¬ 
ods  with  a  sparse  factorization  modified  to  ensure  positive  definiteness.  Gltr  [145]  offers 
a  Newton-Lanczos  method.  There  is  insufficient  experience  to  date  to  say  whether  the 
Newton-Lanczos  method  is  significantly  better  in  practice  than  the  Steihaug  strategy  given  in 
Algorithm  7.2. 

Software  for  computing  incomplete  Cholesky  preconditioners  includes  the  icfs  [193] 
and  ma57  [166]  packages.  A  preconditioner  for  Newton-CG  based  on  limited-memory 
BFGS  approximations  is  provided  in  preqn  [209]. 

Limited-memory  BFGS  methods  are  implemented  in  lbfgs  [194]  and  m1qn3  [122]; 
see  Gill  and  Leonard  [125]  for  a  variant  that  requires  less  storage  and  appears  to  be  quite 
efficient.  The  compact  limited-memory  representations  of  Section  7.2  are  used  in  lbfgs-b 
[322],  ipopt  [301],  and  knitro. 

The  Lancelot  package  exploits  partial  separability.  It  provides  SRI  and  BFGS  quasi- 
Newton  options  as  well  as  a  Newton  methods.  The  step  computation  is  obtained  by  a 
preconditioned  conjugate  gradient  iteration  using  trust  regions.  If  /  is  partially  separable,  a 
general  affine  transformation  will  not  in  general  preserve  the  partially  separable  structure. 
The  quasi-Newton  method  for  partially  separable  functions  described  in  Section  7.4  is  not 
invariant  to  affine  transformations  of  the  variables,  but  this  is  not  a  drawback  because  the 
method  is  invariant  under  transformations  that  preserve  separability. 
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i#7  Exercises 

7.1  Code  Algorithm  7.5,  and  test  it  on  the  extended  Rosenbrock  function 


n/2 

fix)  =  [“(-*2  /  -  +  (1  -  Jf2/-l)2]  , 

;'=  1 

where  a  is  a  parameter  that  you  can  vary  (for  example,  1  or  100).  The  solution  is  x*  — 
(1,  1,  . . . ,  l)r,  /*  =  0.  Choose  the  starting  point  as  (—1,  —1, . . . ,  —  l)r.  Observe  the 
behavior  of  your  program  for  various  values  of  the  memory  parameter  m. 

i#7  7.2  Show  that  the  matrix  Hk+i  in  (7.21)  is  singular. 

i#7  7.3  Derive  the  formula  (7.23)  under  the  assumption  that  line  searches  are  exact. 

7.4  Consider  limited-memory  SRI  updating  based  on  (7.30).  Explain  howthe  storage 
can  be  cut  in  half  if  the  basic  matrix  B°  is  kept  fixed  for  all  k.  (Hint:  Consider  the  matrix 
Qk  —  [<7o,  ■ . . ,  qk- 1]  =  Yk  —  B0Sk.) 

i#7  7.5  Write  the  function  defined  by 

fix )  =  x2x3eXl+X3~Xi  +  (x2x3)2  +  (x3  -  *4) 

in  the  form  (7.40).  In  particular,  give  the  definition  of  each  of  the  compactifying 
transformations  Uj. 

i#7  7.6  Does  the  approximation  B  obtained  by  the  partially  separable  quasi-Newton 

updating  (7.38),  (7.41)  satisfy  the  secant  equation  Bs  —  v? 

i#7  7.7  The  minimum  surface  problem  is  a  classical  application  of  the  calculus  of  vari¬ 

ations  and  can  be  found  in  many  textbooks.  We  wish  to  find  the  surface  of  minimum 
area,  defined  on  the  unit  square,  that  interpolates  a  prescribed  continuous  function  on  the 
boundary  of  the  square.  In  the  standard  discretization  of  this  problem,  the  unknowns  are 
the  values  of  the  sought-after  function  z(x,  y)  on  a  q  x  q  rectangular  mesh  of  points  over 
the  unit  square. 

More  specifically,  we  divide  each  edge  of  the  square  into  q  intervals  of  equal  length, 
yielding  (q  +  l)2  grid  points.  We  label  the  grid  points  as 

X(i-\)(q+i)+i,  ■  ■  ■ ,  Xi(q+ 1)  for  i  —  1,  2,  ...,  q  +  1, 

so  that  each  value  of  i  generates  a  line.  With  each  point  we  associate  a  variable  n  that 
represents  the  height  of  the  surface  at  this  point.  For  the  4 q  grid  points  on  the  boundary 
of  the  unit  square,  the  values  of  these  variables  are  determined  by  the  given  function.  The 
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optimization  problem  is  to  determine  the  other  (q  +  l)2  —  4i q  variables  Zi  so  that  the  total 
surface  area  is  minimized. 

A  typical  subsquare  in  this  partition  looks  as  follows: 


xi+q+ i 

Xj+q+2 

xi 

Xj+ 1 

We  denote  this  square  by  Aj  and  note  that  its  area  is  q2.  The  desired  function  is  z{x,  y),  and 
we  wish  to  compute  its  surface  over  Aj.  Calculus  books  show  that  the  area  of  the  surface  is 
given  by 


Approximate  the  derivatives  by  finite  differences,  and  show  that  /;  has  the  form 


fjW  = 


[{Xj  Xj+q+i)  (Xj- |-i  Xj+q)  ] 


1 

2 


(7.43) 


&  7.8  Compute  the  gradient  of  the  element  function  (7.43)  with  respect  to  the  full 

vector  x.  Show  that  it  contains  at  most  four  nonzeros,  and  that  two  of  these  four  nonzero 
components  are  negatives  of  the  other  two.  Compute  the  Hessian  of  /;,  and  show  that, 
among  the  16  nonzeros,  only  three  different  magnitudes  are  represented.  Also  show  that 
this  Hessian  is  singular. 


Chapter 


Calculating 

Derivatives 


Most  algorithms  for  nonlinear  optimization  and  nonlinear  equations  require  knowledge  of 
derivatives.  Sometimes  the  derivatives  are  easy  to  calculate  by  hand,  and  it  is  reasonable 
to  expect  the  user  to  provide  code  to  compute  them.  In  other  cases,  the  functions  are  too 
complicated,  so  we  look  for  ways  to  calculate  or  approximate  the  derivatives  automatically. 
A  number  of  interesting  approaches  are  available,  of  which  the  most  important  are  probably 
the  following. 

Finite  Differencing.  This  technique  has  its  roots  in  Taylor’s  theorem  (see  Chapter  2).  By 
observing  the  change  in  function  values  in  response  to  small  perturbations  of  the  unknowns 
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near  a  given  point  x,  we  can  estimate  the  response  to  infintesimal  perturbations,  that  is, 
the  derivatives.  For  instance,  the  partial  derivative  of  a  smooth  function  /  :  R”  ->  R  with 
respect  to  the  /  th  variable  x,  can  be  approximated  by  the  central-difference  formula 

9/  ^  /(x  +  ee,)-/(x-ee,) 

dx,  2e 

where  e  is  a  small  positive  scalar  and  e,-  is  the  /th  unit  vector,  that  is,  the  vector  whose 
elements  are  all  0  except  for  a  1  in  the  /th  position. 

Automatic  Differentiation.  This  technique  takes  the  view  that  the  computer  code  for 
evaluating  the  function  can  be  broken  down  into  a  composition  of  elementary  arithmetic 
operations,  to  which  the  chain  rule  (one  of  the  basic  rules  of  calculus)  can  be  applied.  Some 
software  tools  for  automatic  differentiation  (such  as  ADIFOR  [25])  produce  new  code  that 
calculates  both  function  and  derivative  values.  Other  tools  (such  as  ADOL-C  [  154] )  keep  a 
record  of  the  elementary  computations  that  take  place  while  the  function  evaluation  code 
for  a  given  point  x  is  executing  on  the  computer.  This  information  is  processed  to  produce 
the  derivatives  at  the  same  point  x. 

Symbolic  Differentiation.  In  this  technique,  the  algebraic  specification  for  the  function  /  is 
manipulated  by  symbolic  manipulation  tools  to  produce  new  algebraic  expressions  for  each 
component  of  the  gradient.  Commonly  used  symbolic  manipulation  tools  can  be  found  in 
the  packages  Mathematica  [311],  Maple  [304],  and  Macsyma  [197]. 

In  this  chapter  we  discuss  the  first  two  approaches:  finite  differencing  and  automatic 
differentiation. 

The  usefulness  of  derivatives  is  not  restricted  to  algorithms  for  optimization.  Modelers 
in  areas  such  as  design  optimization  and  economics  are  often  interested  in  performing 
post-optimal  sensitivity  analysis ,  in  which  they  determine  the  sensitivity  of  the  optimum  to 
small  perturbations  in  the  parameter  or  constraint  values.  Derivatives  are  also  important  in 
other  areas  such  as  nonlinear  differential  equations  and  simulation. 


8.1  FINITE-DIFFERENCE  DERIVATIVE  APPROXIMATIONS 


Finite  differencing  is  an  approach  to  the  calculation  of  approximate  derivatives  whose 
motivation  (like  that  of  so  many  algorithms  in  optimization)  comes  from  Taylor’s  theorem. 
Many  software  packages  perform  automatic  calculation  of  finite  differences  whenever  the 
user  is  unable  or  unwilling  to  supply  code  to  calculate  exact  derivatives.  Although  they  yield 
only  approximate  values  for  the  derivatives,  the  results  are  adequate  in  many  situations. 

By  definition,  derivatives  are  a  measure  of  the  sensitivity  of  the  function  to  infinitesimal 
changes  in  the  values  of  the  variables.  Our  approach  in  this  section  is  to  make  small,  finite 
perturbations  in  the  values  of  x  and  examine  the  resulting  differences  in  the  function  values. 


8.1.  Finite-Difference  Derivative  Approximations  195 


By  taking  ratios  of  the  function  difference  to  variable  difference,  we  obtain  approximations 
to  the  derivatives. 


APPROXIMATING  THE  GRADIENT 

An  approximation  to  the  gradient  vector  V/(x)  can  be  obtained  by  evaluating  the 
function  /  at  (n  +  1)  points  and  performing  some  elementary  arithmetic.  We  describe  this 
technique,  along  with  a  more  accurate  variant  that  requires  additional  function  evaluations. 

A  popular  formula  for  approximating  the  partial  derivative  3// d.Xj  at  a  given  point  x 
is  the  forward-difference ,  or  one- sided- difference,  approximation,  defined  as 


9/ 

3  Xj 


(x) 


fix  +  eg)  -  f{x) 
6 


(8.1) 


The  gradient  can  be  built  up  by  simply  applying  this  formula  for  i  =  1,2 This 
process  requires  evaluation  of  /  at  the  point  x  as  well  as  the  n  perturbed  points  x  +  ee 
i  —  1,  2, . . . ,  n:  a  total  of  ( n  +  1)  points. 

The  basis  for  the  formula  (8.1)  is  Taylor’s  theorem,  Theorem  2.1  in  Chapter  2.  When 
/  is  twice  continuously  differentiable,  we  have 


f(x  +  p)  =  fix)  +  V  fix)7  p  +  \pTV2fix  +  tp)p,  some  t  e  (0,  1)  (8.2) 

(see  (2.6)).  If  we  choose  L  to  be  a  bound  on  the  size  of  ||  V2/(-)||  in  the  region  of  interest, 
it  follows  directly  from  this  formula  that  the  last  term  in  this  expression  is  bounded  by 
{L/2)\\p\\2,  so  that 


\fix  +  p)-fix)-Vfix)Tp\  <iL/2)\\p\\2.  (8.3) 

We  now  choose  the  vector  p  to  be  ee,-,  so  that  it  represents  a  small  change  in  the  value 
of  a  single  component  of  x  (the  ith  component).  For  this  p,  we  have  that  V  fix)7  p  — 
V  fix)Tet  —  df/dxi,  so  by  rearranging  (8.3),  we  conclude  that 

9/  ,  ,  fix  +  ee,)  -  fix) 

—  ix)  — - Y  <50  where  |<$e|  <  (L/2)e.  (8.4) 

ox,  e 

We  derive  the  forward-difference  formula  (8.1)  by  simply  ignoring  the  error  term  <5f  in  this 
expression,  which  becomes  smaller  and  smaller  as  e  approaches  zero. 

An  important  issue  in  implementing  the  formula  (8.1)  is  the  choice  of  the  parameter 
e.  The  error  expression  (8.4)  suggests  that  we  should  choose  e  as  small  as  possible.  Unfor¬ 
tunately,  this  expression  ignores  the  roundoff  errors  that  are  introduced  when  the  function 
/  is  evaluated  on  a  real  computer,  in  floating-point  arithmetic.  From  our  discussion  in  the 
Appendix  (see  (A.30)  and  (A.31)),  we  know  that  the  quantity  u  known  as  unit  roundoff 
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is  crucial:  It  is  a  bound  on  the  relative  error  that  is  introduced  whenever  an  arithmetic 
operation  is  performed  on  two  floating-point  numbers,  (u  is  about  1.1  x  1CT16  in  double¬ 
precision  IEEE  floating-point  arithmetic.)  The  effect  of  these  errors  on  the  final  computed 
value  of  /  depends  on  the  way  in  which  /  is  computed.  It  could  come  from  an  arithmetic 
formula,  or  from  a  differential  equation  solver,  with  or  without  refinement. 

As  a  rough  estimate,  let  us  assume  simply  that  the  relative  error  in  the  computed  /  is 
bounded  by  u,  so  that  the  computed  values  of  fix)  and  fix  +  ee, )  are  related  to  the  exact 
values  in  the  following  way: 


|comp(/(*))  -  f{x) |  <  u Lf, 

|comp(/(x  +  eei))  -  fix  +  ec,)|  <  u Lf, 

where  comp(-)  denotes  the  computed  value,  and  Lf  isabound  on  the  value  of  |/(-)|  in  the 
region  of  interest.  If  we  use  these  computed  values  of  /  in  place  of  the  exact  values  in  (8.4) 
and  (8.1),  we  obtain  an  error  that  is  bounded  by 

(L/2)e  +  2uLf/e.  (8.5) 

Naturally,  we  would  like  to  choose  e  to  make  this  error  as  small  as  possible;  it  is  easy  to  see 
that  the  minimizing  value  is 


2  4L/U 
e  =  — — . 

L 

If  we  assume  that  the  problem  is  well  scaled,  then  the  ratio  Lf/L  (the  ratio  of  function 
values  to  second  derivative  values)  does  not  exceed  a  modest  size.  We  can  conclude  that  the 
following  choice  of  e  is  fairly  close  to  optimal: 

e  =  Vu.  (8.6) 

(In  fact,  this  value  is  used  in  many  of  the  optimization  software  packages  that  use  finite 
differencing  as  an  option  for  estimating  derivatives.)  For  this  value  of  e,  we  have  from  (8.5) 
that  the  total  error  in  the  forward-difference  approximation  is  fairly  close  to  y'u. 

A  more  accurate  approximation  to  the  derivative  can  be  obtained  by  using  the  central- 
difference  formula,  defined  as 

V  ^  /(,  +  *,,  )-/(,-<<,) 
dxi  2e 


As  we  show  below,  this  approximation  is  more  accurate  than  the  forward-difference  approx¬ 
imation  (8.1).  It  is  also  about  twice  as  expensive,  since  we  need  to  evaluate  /  at  the  points 
x  and  x  ±  ee,-,  i  —  1,  2, . . . ,  n:  a  total  of  2 n  +  1  points. 


8.1.  Finite-Difference  Derivative  Approximations  197 


The  basis  for  the  central  difference  approximation  is  again  Taylor’s  theorem.  When 
the  second  derivatives  of  /  exist  and  are  Lipschitz  continuous,  we  have  from  (8.2)  that 


f(x  +  p)  —  f(x)  +  Wf(x)Tp+jpTV2f(x  +  tp)p  for  some  t  e  (0,  1) 

=  fix)  +  V/(x)rp  +  \pTV2f(x)p  +  O  (M|3)  .  (8.8) 


By  setting  p  =  eet  and  p  —  —eeh  respectively,  we  obtain 

3  f  1,3 2  f  . 

fix  +  eet)  =  /(x)  +  e^:  +  -e2-^-  +  0  (e  )  ’ 

3  f  1,3 2  f  .  ,, 

/U-«,)  =  /(x)-e—  +  r—  +  op). 

(Note  that  the  final  error  terms  in  these  two  expressions  are  generally  not  the  same,  but  they 
are  both  bounded  by  some  multiple  of  e3.)  By  subtracting  the  second  equation  from  the 
first  and  dividing  by  2e,  we  obtain  the  expression 


9/ 

3  Xi 


( x )  = 


fix  +  eg)  -  f(x  -  ee,) 
2e 


+  o  (^2)  ■ 


We  see  from  this  expression  that  the  error  is  O  (e2),  as  compared  to  the  0(e)  error  in  the 
forward-difference  formula  (8.1).  However,  when  we  take  evaluation  error  in  /  into  account, 
the  accuracy  that  can  be  achieved  in  practice  is  less  impressive;  the  same  assumptions  that 
were  used  to  derive  (8.6)  lead  to  an  optimal  choice  of  e  of  about  u1/3  and  an  error  of  about 
u2/3.  In  some  situations,  the  extra  few  digits  of  accuracy  may  improve  the  performance  of 
the  algorithm  enough  to  make  the  extra  expense  worthwhile. 


APPROXIMATING  A  SPARSE  JACOBIAN 

Consider  now  the  case  of  a  vector  function  r  :  1R"  — >  R'" ,  such  as  the  residual  vector 
that  we  consider  in  Chapter  10  or  the  system  of  nonlinear  equations  from  Chapter  11.  The 
matrix  J  (x)  of  first  derivatives  for  this  function  is  defined  as  follows: 


J(x)  — 


3  Xi 


7=1,2, ...,m 

i=l,2, 


Vn(.r)r 

Vr2(x)T 


Vrm(x)T 


(8.9) 


where  r  j,  j  =  1,2, ...  ,m  are  the  components  oir.  The  techniques  described  in  the  previous 
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section  can  be  used  to  evaluate  the  full  Jacobian  J (x)  one  column  at  a  time.  When  r  is  twice 
continuously  differentiable,  we  can  use  Taylor’s  theorem  to  deduce  that 

II r{x  +  p)  -  r(x)  -  J(x)p\\  <  {L/2)\\p\\2,  (8.10) 


where  L  is  a  Lipschitz  constant  for  J  in  the  region  of  interest.  If  we  require  an  approximation 
to  the  Jacobian-vector  product  J  {x)p  for  a  given  vector  p  (as  is  the  case  with  inexact  Newton 
methods  for  nonlinear  systems  of  equations;  see  Section  11.1),  this  expression  immediately 
suggests  choosing  a  small  nonzero  e  and  setting 


J{x)p 


r(x  +  ep)  —  r(x) 
e 


(8.11) 


an  approximation  that  is  accurate  to  0(e).  A  two-sided  approximation  can  be  derived  from 
the  formula  (8.7). 

If  an  approximation  to  the  full  Jacobian  J  (x  )  is  required,  we  can  compute  it  a  column 
at  a  time,  analogously  to  (8.1),  by  setting  set  p  —  ee,  in  (8.10)  to  derive  the  following 
estimate  of  the  /  th  column: 


dr 
3  Xi 


(x) 


r(x  +  eei)  —  r(x) 
e 


(8.12) 


A  full  Jacobian  estimate  can  be  obtained  at  a  cost  of  n  +  1  evaluations  of  the  function  r. 
When  the  Jacobian  is  sparse,  however,  we  can  often  obtain  the  estimate  at  a  much  lower  cost, 
sometimes  just  three  or  four  evaluations  of  r.  The  key  is  to  estimate  a  number  of  different 
columns  of  the  Jacobian  simultaneously,  by  judicious  choices  of  the  perturbation  vector  p 
in  (8.10). 

We  illustrate  the  technique  with  a  simple  example.  Consider  the  function  r  :  R”  ->  R" 
defined  by 


r(x)  - 


2(x\  —  x\ ) 

3(^2  —  x\ )  +  2(x\  —  x%) 
3(*3  -  x\)  +  2(x\  -  x\) 


3  U 


(8.13) 


Each  component  of  r  depends  on  just  two  or  three  components  of  x,  so  that  each  row  of  the 
Jacobian  contains  only  two  or  three  nonzero  elements.  For  the  case  of  n  —  6,  the  Jacobian 
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has  the  following  structure: 


x  x 
xxx 

xxx 

xxx 

xxx 
x  x 


(8.14) 


where  each  cross  represents  a  nonzero  element,  with  zeros  represented  by  a  blank  space. 

Staying  for  the  moment  with  the  case  n  —  6,  suppose  that  we  wish  to  compute  a  finite- 
difference  approximation  to  the  Jacobian.  (Of  course,  it  is  easy  to  calculate  this  particular 
Jacobian  by  hand,  but  there  are  complicated  functions  with  similar  structure  for  which 
hand  calculation  is  more  difficult.)  A  perturbation  p  =  ee\  to  the  first  component  of  x 
will  affect  only  the  first  and  second  components  of  r.  The  remaining  components  will  be 
unchanged,  so  that  the  right-hand-side  of  formula  (8.12)  will  correctly  evaluate  to  zero  in 
the  components  3,  4,  5,  6.  It  is  wasteful,  however,  to  reevaluate  these  components  of  r  when 
we  know  in  advance  that  their  values  are  not  affected  by  the  perturbation.  Instead,  we  look 
for  a  way  to  modify  the  perturbation  vector  so  that  it  does  not  have  any  further  effect  on 
components  1  and  2,  but  does  produce  a  change  in  some  of  the  components  3,4,  5,  6,  which 
we  can  then  use  as  the  basis  of  a  finite-difference  estimate  for  some  other  column  of  the 
Jacobian.  It  is  not  hard  to  see  that  the  additional  perturbation  ee4  has  the  desired  property: 
It  alters  the  3rd,  4th,  and  5th  elements  of  r,  but  leaves  the  1st  and  2nd  elements  unchanged. 
The  changes  in  r  as  a  result  of  the  perturbations  ee\  and  ee4  do  not  interfere  with  each 
other. 

To  express  this  discussion  in  mathematical  terms,  we  set 


P  —  e(«i  +  e4), 


and  note  that 


r(x  +  p)  1,2  =  r(x  +  e(ci  +  e4))i,2  =  r(x  +  eei)h2  (8.15) 

(where  the  notation  [•  ]  i>2  denotes  the  subvector  consisting  of  the  first  and  second  elements), 
while 


r(x  +  p)  3,4,5  —  r{x  +  e(ei  +  e4))3i4,5  —  r{x  +  €e4)3i4j5.  (8.16) 

By  substituting  (8.15)  into  (8.10),  we  obtain 


r(x  +  p)  i,2  =  r(x)  ij2  +€[J(x)e1\hl  +  0(e2). 
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By  rearranging  this  expression,  we  obtain  the  following  difference  formula  for  estimating 
the  (1,1)  and  (2,  1)  elements  of  the  Jacobian  matrix: 


dn 

d.Xi 

3  r2 

d.Xi 


( x ) 

(x) 


[J(x)e  i]i,2 


r(x  +  p) i,2  ~  r(x)h2 
€ 


(8.17) 


A  similar  argument  shows  that  the  nonzero  elements  of  the  fourth  column  of  the  Jacobian 
can  be  estimated  by  substituting  (8.16)  into  (8.10);  we  obtain 


3  r4 

3.f4 

3  r4 
_  3X5 


(x) 

(x) 


=  [J(x)e4\ 3,4,5 


r(x  +  p)3,4,5  -  r(x)3,4,5 
e 


(8.18) 


To  summarize:  We  have  been  able  to  estimate  two  columns  of  the  Jacobian  /  (x )  by  evaluating 
the  function  r  at  the  single  extra  point  x  +  e(ei  +  e4). 

We  can  approximate  the  remainder  of  J  (x )  in  an  economical  manner  as  well.  Columns 
2  and  5  can  be  approximated  by  choosing  p  —  e(e2  +  e5),  while  we  can  use  p  —  e{ei  +  e&) 
to  approximate  columns  3  and  6.  In  total,  we  need  3  evaluations  of  the  function  r  (after  the 
initial  evaluation  at  x)  to  estimate  the  entire  Jacobian  matrix. 

In  fact,  for  any  choice  ofn  in  (8.13)  (no  matter  how  large),  three  extra  evaluations  ofr 
are  sufficient  to  approximate  the  entire  Jacobian.  The  corresponding  choices  of  perturbation 
vectors  p  are 


P  —  c(^i  +  e4  +  e7  +  e10  +  •  •  •), 

P  —  e(e2  +  e5  +  e8  +  en  H - ), 

p  —  e{e  3  +  e6  +  ^9  +  ^12  +  •  •  •)• 

In  the  first  of  these  vectors,  the  nonzero  components  are  chosen  so  that  no  two  of  the 
columns  1,  4,  7,  . . .  have  a  nonzero  element  in  the  same  row.  The  same  property  holds  for 
the  other  two  vectors  and,  in  fact,  points  the  way  to  the  criterion  that  we  can  apply  to  general 
problems  to  decide  on  a  valid  set  of  perturbation  vectors. 

Algorithms  for  choosing  the  perturbation  vectors  can  be  expressed  conveniently  in  the 
language  of  graphs  and  graph  coloring.  For  any  function  r  :  R"  ->  R'" ,  we  can  construct 
a  column  incidence  graph  G  with  n  nodes  by  drawing  an  arc  between  nodes  i  and  k  if 
there  is  some  component  of  r  that  depends  on  both  x,  and  x*-.  In  other  words,  the  ;th  and 
£th  columns  of  the  Jacobian  J(x)  each  have  a  nonzero  element  in  some  row  j,  for  some 
j  —  1,2,  ...  ,m  and  some  value  of  x.  (The  intersection  graph  for  the  function  defined  in 
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Figure  8.1 

Column  incidence  graph  for  r(x)  defined  in  (8.13). 


(8.13),  with  n  =  6,  is  shown  in  Figure  8.1.)  We  now  assign  each  node  a  “color”  according 
to  the  following  rule:  Two  nodes  can  have  the  same  color  if  there  is  no  arc  that  connects 
them.  Finally,  we  choose  one  perturbation  vector  corresponding  to  each  color:  If  nodes 
i i,  C, . . . ,  if  have  the  same  color,  the  corresponding  p  is  e(e{1  +  e,-2  H - +  e,J. 

Usually,  there  are  many  ways  to  assign  colors  to  the  n  nodes  in  the  graph  in  a  way  that 
satisfies  the  required  condition.  The  simplest  way  is  just  to  assign  each  node  a  different  color, 
but  since  that  scheme  produces  n  perturbation  vectors,  it  is  usually  not  the  most  efficient 
approach.  It  is  generally  very  difficult  to  find  the  coloring  scheme  that  uses  the  fewest 
possible  colors,  but  there  are  simple  algorithms  that  do  a  good  job  of  finding  a  near-optimal 
coloring  at  low  cost.  Curtis,  Powell,  and  Reid  [83]  and  Coleman  and  More  [68]  provide 
descriptions  of  some  methods  and  performance  comparisons.  Newsam  and  Ramsdell  [227] 
show  that  by  considering  a  more  general  class  of  perturbation  vectors  p,  it  is  possible 
to  evaluate  the  full  Jacobian  using  no  more  than  nz  evaluations  of  r  (in  addition  to  the 
evaluation  at  the  point  x ),  where  nz  is  the  maximum  number  of  nonzeros  in  each  row 
of  J(x). 

For  some  functions  r  with  well-studied  structures  (those  that  arise  from  discretizations 
of  differential  operators,  or  those  that  give  rise  to  banded  Jacobians,  as  in  the  example  above), 
optimal  coloring  schemes  are  known.  For  the  tridiagonal  Jacobian  of  (8. 14)  and  its  associated 
graph  in  Figure  8.1,  the  scheme  with  three  colors  is  optimal. 

APPROXIMATING  THE  HESSIAN 

In  some  situations,  the  user  may  be  able  to  provide  a  routine  to  calculate  the  gradient 
V/(je)  but  not  the  Hessian  V2/(.r).  We  can  obtain  the  Hessian  by  applying  the  techniques 
described  above  for  the  vector  function  r  to  the  gradient  V/.  By  using  the  graph  coloring 
techniques  discussed  above,  sparse  Hessians  often  can  be  approximated  in  this  manner  by 
using  considerably  fewer  than  n  perturbation  vectors.  This  approach  ignores  symmetry 
of  the  Hessian,  and  will  usually  produce  a  nonsymmetric  approximation.  We  can  recover 
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symmetry  by  adding  the  approximation  to  its  transpose  and  dividing  the  result  by  2. 
Alternative  differencing  approaches  that  take  symmetry  of  V2/(x)  explicitly  into  account 
are  discussed  below. 

Some  important  algorithms — most  notably  the  Newton-CG  methods  described  in 
Chapter  7 — do  not  require  knowledge  of  the  full  Hessian.  Instead,  each  iteration  requires 
us  to  supply  the  Hessian-vector  product  V2  fix)  p,  for  a  given  vector  p.  We  can  obtain  an 
approximation  to  this  matrix-vector  product  by  appealing  once  again  to  Taylor’s  theorem. 
When  second  derivatives  of  /  exist  and  are  Lipschitz  continuous  near  x,  we  have 

V  f{x  +  ep)  =  V  fix)  +  eV2fix)p  +  0(e2),  (8.19) 


so  that 


v2fMp 


V  fix  +  ep)  -  V fix) 

e 


(8.20) 


(see  also  (7.10)).  The  approximation  error  is  (9(e),  and  the  cost  of  obtaining  the  approxi¬ 
mation  is  a  single  gradient  evaluation  at  the  point  x  +  ep.  The  formula  (8.20)  corresponds 
to  the  forward- difference  approximation  (8.1).  A  central-difference  formula  like  (8.7)  can 
be  derived  by  evaluating  V/(x  —  ep)  as  well. 

For  the  case  in  which  even  gradients  are  not  available,  we  can  use  Taylor’s  theorem 
once  again  to  derive  formulae  for  approximating  the  Hessian  that  use  only  function  values. 
The  main  tool  is  the  formula  (8.8):  By  substituting  the  vectors  p  =  ee,-,  p  =  eej,  and 
p  —  e(e,  +  ej)  into  this  formula  and  combining  the  results  appropriately,  we  obtain 


d2/ 

3  Xj  3  Xj 


ix) 


fix  +  eej+eej)-  fjx  +  eej)-  fjx  +  eej)  +  fjx)  |  (g21) 

e2 


If  we  wished  to  approximate  every  element  of  the  Hessian  with  this  formula,  then  we  would 
need  to  evaluate  /  at  x  +  e(e,-  +  ej)  for  all  possible  i  and  j  (a  total  of  n(n  +  l)/2  points) 
as  well  as  at  the  n  points  x  +  een  i  =  1,  2,  ...,«.  If  the  Hessian  is  sparse,  we  can,  of  course, 
reduce  this  operation  count  by  skipping  the  evaluation  whenever  we  know  the  element 
32  f/d.Xidxj  to  be  zero. 


APPROXIMATING  A  SPARSE  HESSIAN 

We  noted  above  that  a  Hessian  approximation  can  be  obtained  by  applying  finite- 
difference  Jacobian  estimation  techniques  to  the  gradient  V/,  treated  as  a  vector  function. 
We  now  show  how  symmetry  of  the  Hessian  V2/  can  be  used  to  reduce  the  number  of 
perturbation  vectors  p  needed  to  obtain  a  complete  approximation,  when  the  Hessian 
is  sparse.  The  key  observation  is  that,  because  of  symmetry,  any  estimate  of  the  element 
[V2/(x)];j  =  d2  fix)/dxjdxj  is  also  an  estimate  ofits  symmetric  counterpart  [V2/(.r)]y,-. 
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We  illustrate  the  point  with  the  simple  function  /  :  R”  — >  R  defined  by 

n 

/(x)  =  ^\'2x2.  (8.22) 

i= 1 

It  is  easy  to  show  that  the  Hessian  V2/  has  the  “arrowhead”  structure  depicted  below,  for 
the  case  of  n  =  6: 


X 

X 

X 

X 

X 

X 

X 

X 

X 

X 

X 

X 

X 

X 

X 

X 

(8.23) 


If  we  were  to  construct  the  intersection  graph  for  the  function  V  f  (analogous  to  Figure  8.1), 
we  would  find  that  every  node  is  connected  to  every  other  node,  for  the  simple  reason  that 
row  1  has  a  nonzero  in  every  column.  According  to  the  rule  for  coloring  the  graph,  then,  we 
would  have  to  assign  a  different  color  to  every  node,  which  implies  that  we  would  need  to 
evaluate  V/  at  the  n  +  1  points  x  and  x  +  ee\  for  i  =  1,2 

We  can  construct  a  much  more  efficient  scheme  by  taking  the  symmetry  into  account. 
Suppose  we  first  use  the  perturbation  vector  p  —  ee\  to  estimate  the  first  column  of  V2/(.r). 
Because  of  symmetry,  the  same  estimates  apply  to  the  first  row  of  V2/.  From  (8.23),  we  see 
that  all  that  remains  is  to  find  the  diagonal  elements  V2/(x)2 2,  V2/(x)33, ....  V2/(x)66. 
The  intersection  graph  for  these  remaining  elements  is  completely  disconnected,  so  we  can 
assign  them  all  the  same  color  and  choose  the  corresponding  perturbation  vector  to  be 

p  =  e(e2  +  e3  +  •  •  •  +  e6)  =  e(0,  1,  1,  1,  1,  l)r.  (8.24) 

Note  that  the  second  component  of  V  /  is  not  affected  by  the  perturbations  in  components 
3,  4,  5,  6  of  the  unknown  vector,  while  the  third  component  of  V/  is  not  affected  by 
perturbations  in  components  2,  4,  5,  6  of  x,  and  so  on.  As  in  (8.15)  and  (8.16),  we  have  for 
each  component  i  that 

V/(x  +  p)i  =  V/(x  +  e(e2  +  e3  H - h  e6))z  =  V/(x  +  ee,-),-. 


By  applying  the  forward- difference  formula  (8.1)  to  each  of  these  individual  components, 
we  then  obtain 


11 

dxf 


V/(x  +  ee,-),-  -  V / (x)i 


V/(x  +  ep)i  -  V /(*),• 


e 


e 


i  —  2,3, ...  ,6. 
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By  exploiting  symmetry,  we  have  been  able  to  estimate  the  entire  Hessian  by  evaluating  V/ 
only  at  x  and  two  other  points. 

Again,  graph-coloring  techniques  can  be  used  to  choose  the  perturbation  vectors  p 
economically.  We  use  the  adjacency  graph  in  place  of  the  intersection  graph  described  earlier. 
The  adjacency  graph  has  n  nodes,  with  arcs  connecting  nodes  i  and  k  whenever  i  ^  k  and 
32  /  (x )  /  ( 3 Xi  dxk )  0  for  some  x .  The  requirements  on  the  coloring  scheme  are  a  little  more 
complicated  than  before,  however.  We  require  not  only  that  connected  nodes  have  different 
colors,  but  also  that  any  path  of  length  3  through  the  graph  contain  at  least  three  colors.  In 
other  words,  if  there  exist  nodes  4,  4, 4,  4  in  the  graph  that  are  connected  by  arcs  (4,  4), 
(4,  4),  and  (4,  4),  then  at  least  three  different  colors  must  be  used  in  coloring  these  four 
nodes.  See  Coleman  and  More  [69]  for  an  explanation  of  this  rule  and  for  algorithms  to 
compute  valid  colorings.  The  perturbation  vectors  are  constructed  as  before:  Whenever  the 
nodes  4,  4, . . .  ,4  have  the  same  color,  we  set  the  corresponding  perturbation  vector  to  be 
P  -  e(eh  +  eh  H - h  eit). 


8.2  AUTOMATIC  DIFFERENTIATION 


Automatic  differentiation  is  the  generic  name  for  techniques  that  use  the  computational 
representation  of  a  function  to  produce  analytic  values  for  the  derivatives.  Some  techniques 
produce  code  for  the  derivatives  at  a  general  point  x  by  manipulating  the  function  code 
directly.  Other  techniques  keep  a  record  of  the  computations  made  during  the  evaluation 
of  the  function  at  a  specific  point  x  and  then  review  this  information  to  produce  a  set  of 
derivatives  at  x. 

Automatic  differentiation  techniques  are  founded  on  the  observation  that  any  func¬ 
tion,  no  matter  how  complicated,  is  evaluated  by  performing  a  sequence  of  simple  elementary 
operations  involving  just  one  or  two  arguments  at  a  time.  Two-argument  operations  include 
addition,  multiplication,  division,  and  the  power  operation  ab.  Examples  of  single-argument 
operations  include  the  trigonometric,  exponential,  and  logarithmic  functions.  Another  com¬ 
mon  ingredient  of  the  various  automatic  differentiation  tools  is  their  use  of  the  chain  rule. 
This  is  the  well-known  rule  from  elementary  calculus  that  says  that  if  h  is  a  function  of  the 
vector  y  e  R"' ,  which  is  in  turn  a  function  of  the  vector  x  e  R” ,  we  can  write  the  derivative 
of  h  with  respect  to  x  as  follows: 


S7xh(y(x))  = 


E 


dh 

3  yi 


Vv,(x). 


(8.25) 


See  Appendix  A  for  further  details. 

There  are  two  basic  modes  of  automatic  differentiation:  the  forward  and  reverse  modes. 
The  difference  between  them  can  be  illustrated  by  a  simple  example.  We  work  through  such 
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an  example  below,  and  indicate  how  the  techniques  can  be  extended  to  general  functions, 
including  vector  functions. 

AN  EXAMPLE 

Consider  the  following  function  of  3  variables: 

f(x)  —  (.ViX2sinx3  +  eXlXl)/x-j.  (8.26) 

Figure  8.2  shows  how  the  evaluation  of  this  function  can  be  broken  down  into  its  elementary 
operations  and  also  indicates  the  partial  ordering  associated  with  these  operations.  For 
instance,  the  multiplication  X\  *  X2  must  take  place  prior  to  the  exponentiation  ex,x‘,  or  else 
we  would  obtain  the  incorrect  result  ( eXl  )x2-  This  graph  introduces  the  intermediate  variables 
X4,  X5, . . .  that  contain  the  results  of  intermediate  computations;  they  are  distinguished  from 
the  independent  variables  Xi,  X2,  X3  that  appear  at  the  left  of  the  graph.  We  can  express  the 
evaluation  of  /  in  arithmetic  terms  as  follows: 


X4  —  Xi  *  X-2, 
x5  —  sinx3, 
x6  =  ex\ 

X7  —  X4  *  x$ , 
Xg  =  X6  +  X7, 
X9  —  Xg/x3. 


(8.27) 


The  final  node  x9  in  Figure  8.2  contains  the  function  value  f(x).  In  the  terminology 
of  graph  theory,  node  i  is  the  parent  of  node  j,  and  node  j  the  child  of  node  i,  whenever 
there  is  a  directed  arc  from  i  to  j .  Any  node  can  be  evaluated  when  the  values  of  all  its 
parents  are  known,  so  computation  flows  through  the  graph  from  left  to  right.  Flow  of 


Figure  8.2  Computational  graph  for  f(x)  defined  in  (8.26). 
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computation  in  this  direction  is  known  as  a  forward  sweep.  It  is  important  to  emphasize 
that  software  tools  for  automatic  differentiation  do  not  require  the  user  to  break  down  the 
code  for  evaluating  the  function  into  its  elements,  as  in  (8.27).  Identification  of  intermediate 
quantities  and  construction  of  the  computational  graph  is  carried  out,  explicitly  or  implicitly, 
by  the  software  tool  itself. 


THE  FORWARD  MODE 

In  the  forward  mode  of  automatic  differentiation,  we  evaluate  and  carry  forward 
a  directional  derivative  of  each  intermediate  variable  x,  in  a  given  direction  p  e  B”, 
simultaneously  with  the  evaluation  ofx,  itself.  For  the  three-variable  example  above,  we  use 
the  following  notation  for  the  directional  derivative  for  p  associated  with  each  variable: 


DpXj  =  (Vx,)7 p  —  ^  T^-Pj’  1  =  !’  2 


j= i 


3  Xj 


(8.28) 


where  V  indicates  the  gradient  with  respect  to  the  three  independent  variables.  Our  goal 
is  to  evaluate  Dpx<),  which  is  the  same  as  the  directional  derivative  V  f(x)T p.  We  note 
immediately  that  initial  values  DpXj  for  the  independent  variables  x, ,  i  —  1,  2,  3,  are  simply 
the  components  p\ ,  pi.  ps  of  p.  The  direction  p  is  referred  to  as  the  seed  vector. 

As  soon  as  the  value  of  Xj  at  any  node  is  known,  we  can  find  the  corresponding  value 
of  DpXj  from  the  chain  rule.  For  instance,  suppose  we  know  the  values  of  x4,  Dpx 4,  X5,  and 
Dpx 5,  and  we  are  about  to  calculate  x1  in  Figure  8.2.  We  have  that  x1  —  x4x5;  that  is,  x7 
is  a  function  of  the  two  variables  x4  and  x5,  which  in  turn  are  functions  of  X\,  x7,  X3.  By 
applying  the  rule  (8.25),  we  have  that 

8x7  9X7 

Vx7  =  - Vx4  -) - Vx5  =  X5VX4  +  X4V.X5. 

8x4  0X5 

By  taking  the  inner  product  of  both  sides  of  this  expression  with  p  and  applying  the  definition 
(8.28),  we  obtain 


Dpx  7  = 


3x7  3x7 

- — Dpx  4  +  - — Dpx5  —  xsDpx  4  +  x4D„x  5. 

0X4  OX 5 


(8.29) 


The  directional  derivatives  Dpx ,■  are  therefore  evaluated  side  by  side  with  the  intermediate 
results  Xi,  and  at  the  end  of  the  process  we  obtain  Dpxg  —  Dpf  —  V  f{x)Tp. 

The  principle  of  the  forward  mode  is  straightforward  enough,  but  what  of  its  practical 
implementation  and  computational  requirements?  First,  we  repeat  that  the  user  does  not 
need  to  construct  the  computational  graph,  break  the  computation  down  into  elementary 
operations  as  in  (8.27),  or  identify  intermediate  variables.  The  automatic  differentiation 
software  should  perform  these  tasks  implicitly  and  automatically.  Nor  is  it  necessary  to  store 
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the  information  x,  and  Dpxl  for  every  node  of  the  computation  graph  at  once  (which  is  just 
as  well,  since  this  graph  can  be  very  large  for  complicated  functions).  Once  all  the  children 
of  any  node  have  been  evaluated,  its  associated  values  xt  and  DpXj  are  not  needed  further 
and  may  be  overwritten  in  storage. 

The  key  to  practical  implementation  is  the  side-by-side  evaluation  of  x,  and  DpXj .  The 
automatic  differentiation  software  associates  a  scalar  Dpw  with  any  scalar  w  that  appears 
in  the  evaluation  code.  Whenever  w  is  used  in  an  arithmetic  computation,  the  software 
performs  an  associated  operation  (based  on  the  chain  rule)  with  the  gradient  vector  Dpw. 
For  instance,  if  w  is  combined  in  a  division  operation  with  another  value  y  to  produce  a 
new  value  z,  that  is, 


Z  <r- 


w 

y  ’ 


we  use  w,  z,  Dpw,  and  Dpy  to  evaluate  the  directional  derivative  Dpz  as  follows: 


Dpz 


1  w 

—  Dpw - iDpy. 

y  r 


(8.30) 


To  obtain  the  complete  gradient  vector,  we  can  carry  out  this  procedure  simultaneously 
forthew  seed  vectors  p  —  e1;  e2,  •  •  • ,  .  By  the  definition  (8.28),  we  see  that  p  —  ej  implies 

that  Dpf  —  df/dxj,  j  —  1,2, ,  n.  We  note  from  the  example  (8.30)  that  the  additional 
cost  of  evaluating  /  and  V/  (over  the  cost  of  evaluating  /  alone)  may  be  significant.  In 
this  example,  the  single  division  operation  on  w  and  y  needed  to  calculate  z  gives  rise 
to  approximately  2 n  multiplications  and  n  additions  in  the  computation  of  the  gradient 
elements  De.z,  j  —  1,  2, ...,».  It  is  difficult  to  obtain  an  exact  bound  on  the  increase  in 
computation,  since  the  costs  of  retrieving  and  storing  the  data  should  also  be  taken  into 
account.  The  storage  requirements  may  also  increase  by  a  factor  as  large  as  n,  since  we 
now  have  to  store  n  additional  scalars  De.Xi,  j  —  1,2 , ...  ,n,  alongside  each  intermediate 
variable  x / .  It  is  usually  possible  to  make  savings  by  observing  that  many  of  these  quantities 
are  zero,  particularly  in  the  early  stages  of  the  computation  (that  is,  toward  the  left  of  the 
computational  graph),  so  sparse  data  structures  can  be  used  to  store  the  vectors  De.Xj, 
j  —  1,  2, . . . ,  n  (see  [27]). 

The  forward  mode  of  automatic  differentiation  can  be  implemented  by  means  of  a 
precompiler,  which  transforms  function  evaluation  code  into  extended  code  that  evaluates 
the  derivative  vectors  as  well.  An  alternative  approach  is  to  use  the  operator-overloading 
facilities  available  in  languages  such  as  C++  to  transparently  extend  the  data  structures  and 
operations  in  the  manner  described  above. 


THE  REVERSE  MODE 

The  reverse  mode  of  automatic  differentiation  does  not  perform  function  and  gradient 
evaluations  concurrently.  Instead,  after  the  evaluation  of  /  is  complete,  it  recovers  the  partial 
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derivatives  of  /  with  respect  to  each  variable  X; — independent  and  intermediate  variables 
alike — by  performing  a  reverse  sweep  of  the  computational  graph.  At  the  conclusion  of  this 
process,  the  gradient  vector  V/  can  be  assembled  from  the  partial  derivatives  df/dxj  with 
respect  to  the  independent  variables  x,-,  /  =  1,  2, . . . ,  n. 

Instead  of  the  gradient  vectors  DpXj  used  in  the  forward  mode,  the  reverse  mode 
associates  a  scalar  variable  it;  with  each  node  in  the  graph;  information  about  the  partial 
derivative  3//3x;  is  accumulated  in  x*  during  the  reverse  sweep.  The  it;  are  sometimes 
called  the  adjoint  variables,  and  we  initialize  their  values  to  zero,  with  the  exception 
of  the  rightmost  node  in  the  graph  (node  N,  say),  for  which  we  set  xn  =  1.  This 
choice  makes  sense  because  xv  contains  the  final  function  value  /,  so  we  have  3// 
3xai  =  1. 

The  reverse  sweep  makes  use  of  the  following  observation,  which  is  again  based  on 
the  chain  rule  (8.25):  For  any  node  i,  the  partial  derivative  3//3x,  can  be  built  up  from 
the  partial  derivatives  df/dxj  corresponding  to  its  child  nodes  j  according  to  the  following 
formula: 


3  X; 


E 

j  a  child  of  i 


3/  3  xj 

3 Xj  dxj 


(8.31) 


For  each  node  i,  we  add  the  right-hand-side  term  in  (8.31)  to  x;  as  soon  as  it  becomes 
known;  that  is,  we  perform  the  operation 


3/  3  xj 

3  Xj  3  X; 


(8.32) 


(In  this  expression  and  the  ones  below,  we  use  the  arithmetic  notation  of  the  programming 
language  C,  in  which  x+—a  means  x  -a-  x  +  a.)  Once  contributions  have  been  received 
from  all  the  child  nodes  of  i,  we  have  x;  =  3//3x,  ,  so  we  declare  node  i  to  be  “finalized.” 
At  this  point,  node  i  is  ready  to  contribute  a  term  to  the  summation  for  each  of  its  parent 
nodes  according  to  the  formula  (8.31).  The  process  continues  in  this  fashion  until  all  nodes 
are  finalized.  Note  that  for  derivative  evaluation,  the  flow  of  computation  in  the  graph 
is  from  children  to  parents — the  opposite  direction  to  the  computation  flow  for  function 
evaluation. 

During  the  reverse  sweep,  we  work  with  numerical  values,  not  with  formulae  or 
computer  code  involving  the  variables  X;  or  the  partial  derivatives  3//3x,  .  During  the 
forward  sweep — the  evaluation  of  / — we  not  only  calculate  the  values  of  each  variable  x,- , 
but  we  also  calculate  and  store  the  numerical  values  of  each  partial  derivative  3  Xj  /  3x, .  Each 
of  these  partial  derivatives  is  associated  with  a  particular  arc  of  the  computational  graph. 
The  numerical  values  of  dxj /3x;  computed  during  the  forward  sweep  are  then  used  in  the 
formula  (8.32)  during  the  reverse  sweep. 

We  illustrate  the  reverse  mode  for  the  example  function  (8.26).  In  Figure  8.3  we  fill 
in  the  graph  of  Figure  8.2  for  a  specific  evaluation  point  x  =  (1,  2,  tc/2)t ,  indicating  the 
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Figure  8.3  Computational  graph  for  fix)  defined  in  (8.26)  showing  numerical 
values  of  intermediate  values  and  partial  derivatives  for  the  point  x  —  (1,  2,  n /2)T . 
Notation:  p{j ,  i)  —  dxj/dxj. 


numerical  values  of  the  intermediate  variables  x4,  x5, . . . ,  x9  associated  with  each  node  and 
the  partial  derivatives  3 xj /dxt  associated  with  each  arc. 

As  mentioned  above,  we  initialize  the  reverse  sweep  by  setting  all  the  adjoint  variables 
Xj  to  zero,  except  for  the  rightmost  node,  for  which  we  have  x9  —  1.  Since  f(x)  —  Xg  and 
since  node  9  has  no  children,  we  have  Xg  —  df/dxg,  and  so  we  can  immediately  declare  node 
9  to  be  finalized. 

Node  9  is  the  child  of  nodes  3  and  8,  so  we  use  formula  (8.32)  to  update  the  values  of 
X3  and  xg  as  follows: 


3/  3x9 
3X9  9*3 


3/  3xg 
3x9  9x8 


2  +  e2  _  -8  -  4<?2 
(7T/2)2  7T2 

1  _  2 

It /2  7T 


(8.33a) 

(8.33b) 


Node  3  is  not  finalized  after  this  operation;  it  still  awaits  a  contribution  from  its  other  child, 
node  5.  On  the  other  hand,  node  9  is  the  only  child  of  node  8,  so  we  can  declare  node  8  to 
be  finalized  with  the  value  —2 /n .  We  can  now  update  the  values  of  x(-  at  the  two  parent 
nodes  of  node  8  by  applying  the  formula  (8.32)  once  again;  that  is, 


-  _  9/  9x8 

6  3x8  9x6 

-  _  9/  9*8 

7  3x8  3*7 


2 

7T  ’ 
2 
7 r 


At  this  point,  nodes  6  and  7  are  finalized,  so  we  can  use  them  to  update  nodes  4  and  5.  At 
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the  end  of  this  process,  when  all  nodes  are  finalized,  nodes  1,  2,  and  3  contain 


Xi 

(4  +  4e2)/7T 

x2 

=  V/(jc)  = 

(2  +  2e2)/n 

X3 

(—8  —  Ae2)/jt2 

and  the  derivative  computation  is  complete. 

The  main  appeal  of  the  reverse  mode  is  that  its  computational  complexity  is  low  for 
the  scalar  functions  /  :  R"  -*  R  discussed  here.  The  extra  arithmetic  associated  with 
the  gradient  computation  is  at  most  four  or  five  times  the  arithmetic  needed  to  evaluate 
the  function  alone.  Taking  the  division  operation  in  (8.33)  as  an  example,  we  see  that  two 
multiplications,  a  division,  and  an  addition  are  required  for  (8.33a),  while  a  division  and 
an  addition  are  required  for  (8.33b).  This  is  about  five  times  as  much  work  as  the  single 
division  involving  these  nodes  that  was  performed  during  the  forward  sweep. 

As  we  noted  above,  the  forward  mode  may  require  up  to  n  times  more  arithmetic 
to  compute  the  gradient  V/  than  to  compute  the  function  /  alone,  making  it  appear 
uncompetitive  with  the  reverse  mode.  When  we  consider  vector  functions  r  :  R"  -»  R'" , 
the  relative  costs  of  the  forward  and  reverse  modes  become  more  similar  as  m  increases,  as 
we  describe  in  the  next  section. 

An  apparent  drawback  of  the  reverse  mode  is  the  need  to  store  the  entire  computational 
graph,  which  is  needed  for  the  reverse  sweep.  In  principle,  storage  of  this  graph  is  not  too  dif¬ 
ficult  to  implement.  Whenever  an  elementary  operation  is  performed,  we  can  form  and  store 
a  new  node  containing  the  intermediate  result,  pointers  to  the  (one  or  two)  parent  nodes,  and 
the  partial  derivatives  associated  with  these  arcs.  During  the  reverse  sweep,  the  nodes  can  be 
read  in  the  reverse  order  to  that  in  which  they  were  written,  giving  a  particularly  simple  access 
pattern.  The  process  of  forming  and  writing  the  graph  can  be  implemented  as  a  straightfor¬ 
ward  extension  to  the  elementary  operations  via  operator  overloading  (as  in  ADOL-C  [154]). 
The  reverse  sweep/gradient  evaluation  can  be  invoked  as  a  simple  function  call. 

Unfortunately,  the  computational  graph  may  require  a  huge  amount  of  storage.  If  each 
node  can  be  stored  in  20  bytes,  then  a  function  that  requires  one  second  of  evaluation  time 
on  a  100  megaflop  computer  may  produce  a  graph  of  up  to  2  gigabytes  in  size.  The  storage 
requirements  can  be  reduced,  at  the  cost  of  some  extra  arithmetic,  by  performing  partial 
forward  and  reverse  sweeps  on  pieces  of  the  computational  graph,  reevaluating  portions  of 
the  graph  as  needed  rather  than  storing  the  whole  structure.  Descriptions  of  this  approach, 
sometimes  known  as  checkpointing,  can  be  found  in  Griewank  [150]  and  Grimm,  Pottier,  and 
Rostaing-Schmidt  [157].  An  implementation  of  checkpointing  in  the  context  of  variational 
data  assimilation  can  be  found  in  Restrepo,  Leaf,  and  Griewank  [264]  . 

VECTOR  FUNCTIONS  AND  PARTIAL  SEPARABILITY 

So  far,  we  have  looked  at  automatic  differentiation  of  general  scalar- valued  functions 
/  :  R”  — >  R.  In  nonlinear  least-squares  problems  (Chapter  10)  and  nonlinear  equations 
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(Chapter  11),  we  have  to  deal  with  vector  functions  r  :  R"  — >  Rm  with  m  components 
rj,  j  —  1,2 , ,m.  The  rightmost  column  of  the  computational  graph  then  consists  of  m 
nodes,  none  of  which  has  any  children,  in  place  of  the  single  node  described  above.  The 
forward  and  reverse  modes  can  be  adapted  in  straightforward  ways  to  find  the  Jacobian 
J{x),  the  m  x  n  matrix  defined  in  (8.9). 

Besides  their  applications  to  least-squares  and  nonlinear-equations  problems,  auto¬ 
matic  differentiation  of  vector  functions  is  a  useful  technique  for  dealing  with  partially 
separable  functions.  We  recall  that  partial  separability  is  commonly  observed  in  large-scale 
optimization,  and  we  saw  in  Chapter  7  that  there  exist  efficient  quasi-Newton  procedures  for 
the  minimization  of  objective  functions  with  this  property.  Since  an  automatic  procedure  for 
detecting  the  decomposition  of  a  given  function  /  into  its  partially  separable  representation 
was  developed  recently  by  Gay  [118],  it  has  become  possible  to  exploit  the  efficiencies  that 
accrue  from  this  property  without  asking  much  information  from  the  user. 

In  the  simplest  sense,  a  function  /  is  partially  separable  if  we  can  express  it  in  the  form 


ne 

fix)  =  ■/!•(*).  (8-34) 

i=i 

where  each  element  function  /)(•)  depends  on  just  a  few  components  of  x.  If  we  construct 
the  vector  function  r  from  the  partially  separable  components,  that  is, 


r(x)  - 


fi(x) 

h(x ) 


fne  W 


it  follows  from  (8.34)  that 


V  fix)  =  Jixfe, 


(8.35) 


where,  as  usual,  e  =  (1,  1,  ...,l)r.  Because  of  the  partial  separability  property,  most 
columns  of  J(x)  contain  just  a  few  nonzeros.  This  structure  makes  it  possible  to  calculate 
J ix)  efficiently  by  applying  graph-coloring  techniques,  as  we  discuss  below.  The  gradient 
V  fix)  can  then  be  recovered  from  the  formula  (8.35). 

In  constrained  optimization,  it  is  often  beneficial  to  evaluate  the  objective  function  / 
and  the  constraint  functions  c,- ,  i  e  XUf,  simultaneously.  By  doing  so,  we  can  take  advantage 
of  common  expressions  (which  show  up  as  shared  intermediate  nodes  in  the  computation 
graph)  and  thus  can  reduce  the  total  workload.  In  this  case,  the  vector  function  r  can  be 
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defined  as 

r(x)  -  \  m 

Y  J ' 

An  example  of  shared  intermediate  nodes  was  seen  in  Figure  8.2,  where  X4  is  shared  during 
the  computation  of  x6  and  x7. 

CALCULATING  JACOBIANS  OF  VECTOR  FUNCTIONS 

The  forward  mode  is  the  same  for  vector  functions  as  for  scalar  functions.  Given  a 
seed  vector  p,  we  continue  to  associate  quantities  Dpx ,  with  the  node  that  calculates  each 
intermediate  variable  x-,.  At  each  of  the  rightmost  nodes  (containing  rj,  j  —  1,2, ,  m ), 
this  variable  contains  the  quantity  Dprj  —  (Vr,-)7" p,  j  =  1,2,  ...,m.  By  assembling 
these  m  quantities,  we  obtain  J{x)p,  the  product  of  the  Jacobian  and  our  chosen  vector 
p.  As  in  the  case  of  scalar  functions  (m  =  1),  we  can  evaluate  the  complete  Jacobian 
by  setting  p  =  e\,  e7, . . . ,  en  and  evaiuating  the  n  quantities  De. Xj  simultaneously.  For 
sparse  Jacobians,  we  can  use  the  coloring  techniques  outlined  above  in  the  context  of  finite- 
difference  methods  to  make  more  intelligent  and  economical  choices  of  the  seed  vectors  p. 
The  factor  of  increase  in  cost  of  arithmetic,  when  compared  to  a  single  evaluation  of  r,  is 
about  equal  to  the  number  of  seed  vectors  used. 

The  key  to  applying  the  reverse  mode  to  a  vector  function  r(x)  is  to  choose  seed 
vectors  q  e  R"!  and  apply  the  reverse  mode  to  the  scalar  functions  r(x)Tq.  The  result  of 
this  process  is  the  vector 

rj(x)  =  J{x)Tq. 

Instead  of  the  Jacobian-vector  product  that  we  obtain  with  the  forward  mode,  the  reverse 
mode  yields  a  Jacobian-transpose-vector  product.  The  technique  can  be  implemented  by 
seeding  the  variables  i,-  in  the  m  dependent  nodes  that  contain  ri,r7, ... ,  rm,  with  the 
components  q\,  q7, . . . ,  qm  of  the  vector  q.  At  the  end  of  the  reverse  sweep,  the  node  for 
independent  variables  X\,  x7, . . . ,  xn  will  contain 

- \r{x)T q]  ,  i  —  1,  2,  . . . ,  n, 

dx, 

which  are  simply  the  components  of  J{x)T q. 

As  usual,  we  can  obtain  the  full  Jacobian  by  carrying  out  the  process  above  for  the  m 
unit  vectors  q  —  e\,e7, ... ,  em.  Alternatively,  for  sparse  Jacobians,  we  can  apply  the  usual 
coloring  techniques  to  find  a  smaller  number  of  seed  vectors  q — the  only  difference  being 
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that  the  graphs  and  coloring  strategies  are  defined  with  reference  to  the  transpose  J{x)T 
rather  than  to  J  (x)  itself.  The  factor  of  increase  in  the  number  of  arithmetic  operations 
required,  in  comparison  to  an  evaluation  of  r  alone,  is  no  more  than  5  times  the  number 
of  seed  vectors.  (The  factor  of  5  is  the  usual  overhead  from  the  reverse  mode  for  a  scalar 
function.)  The  space  required  for  storage  of  the  computational  graph  is  no  greater  than  in 
the  scalar  case.  As  before,  we  need  only  store  the  graph  topology  information  together  with 
the  partial  derivative  associated  with  each  arc. 

The  forward-  and  reverse-mode  techniques  can  be  combined  to  cumulatively  reveal 
all  the  elements  of  J(x).  We  can  choose  a  set  of  seed  vectors  p  for  the  forward  mode  to 
reveal  some  columns  of  J ,  then  perform  the  reverse  mode  with  another  set  of  seed  vectors 
q  to  reveal  the  rows  that  contain  the  remaining  elements. 

Finally,  we  note  that  for  some  algorithms,  we  do  not  need  full  knowledge  of  the 
Jacobian  J(x).  For  instance,  iterative  methods  such  as  the  inexact  Newton  method  for 
nonlinear  equations  ( see  Section  1 1 . 1 )  require  repeated  calculation  of  /  (x )  p  for  a  succession 
of  vectors  p.  Each  such  matrix-vector  product  can  be  computed  using  the  forward  mode 
by  using  a  single  forward  sweep,  at  a  similar  cost  to  evaluation  of  the  function  alone. 


CALCULATING  HESSIANS:  FORWARD  MODE 

So  far,  we  have  described  how  the  forward  and  reverse  modes  can  be  applied  to 
obtain  first  derivatives  of  scalar  and  vector  functions.  We  now  outline  extensions  of  these 
techniques  to  the  computation  of  the  Hessian  V2/  of  a  scalar  function  /,  and  evaluation  of 
the  Hessian-vector  product  V2  f{x)p  for  a  given  vector  p. 

Recall  that  the  forward  mode  makes  use  of  the  quantities  DpXj,  each  of  which  stores 
(V xi)T p  for  each  node  i  in  the  computational  graph  and  a  given  vector  p.  For  a  given  pair 
of  seed  vectors  p  and  q  (both  in  R" )  we  now  define  another  scalar  quantity  by 


DpqXi  —  pT  (V2xt)q, 


(8.36) 


for  each  node  i  in  the  computational  graph.  We  can  evaluate  these  quantities  during  the 
forward  sweep  through  the  graph,  alongside  the  function  values  x,  and  the  first- derivative 
values  DpXi .  The  initial  values  of  Dpq  at  the  independent  variable  nodes  x,- ,  i  —  1,2  ...,«, 
will  be  0,  since  the  second  derivatives  of  x,-  are  zero  at  each  of  these  nodes.  When  the  forward 
sweep  is  complete,  the  value  of  DpqXj  in  the  rightmost  node  of  the  graph  will  be  pTV2  f(x)q. 

The  formulae  for  transformation  of  the  Dpqxi  variables  during  the  forward  sweep 
can  once  again  be  derived  from  the  chain  rule.  For  instance,  if  x,  is  obtained  by  adding  the 
values  at  its  two  parent  nodes,  x,-  =  x;  +  Xk,  the  corresponding  accumulation  operations 
on  DpXi  and  DpqXi  are  as  follows: 


DpXj  —  DpXj  -j-  DpXk,  DpqXj  —  DpqXj  Dpqx^.  (8.37) 
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The  other  binary  operations  — ,  x,  /  are  handled  similarly.  If  x,  is  obtained  by  applying  the 
unitary  transformation  L  to  Xj ,  we  have 


X,-  =  L(xj),  (8.38a) 

DpXi  —  L'(xj)(DpXj),  (8.38b) 

D  pqXj  —  L"(xj)(DpXj)(DqXj)  +  L\xj)DpqXj.  (8.38c) 


We  see  in  (8.38c)  that  computation  of  Dpqxi  can  rely  on  the  first-derivative  quantities 
DpXi  and  DqXj,  so  both  these  quantities  must  be  accumulated  during  the  forward  sweep 
as  well. 

We  could  compute  a  general  dense  Hessian  by  choosing  the  pairs  {p,  q)  to  be  all 
possible  pairs  of  unit  vectors  {ej,  e^),  for  j  —  1,2, ,  n  and  k  —  l,  2, ....  j ,  a  total  of 
n(n  +  l)/2  vector  pairs.  (Note  that  we  need  only  evaluate  the  lower  triangle  of  V2/(x), 
because  of  symmetry.)  When  we  know  the  sparsity  structure  of  V2/(x),  we  need  evaluate 
DejCkXi  only  for  the  pairs  {ej,  e^)  for  which  the  {j,  k)  component  of  V2/(x)  is  possibly 
nonzero. 

The  total  increase  factor  for  the  number  of  arithmetic  operations,  compared  with  the 
amount  of  arithmetic  to  evaluate  /  alone,  is  a  small  multiple  of  1  +  n  +  Nz(V2f),  where 
W(V2/)  is  the  number  of  elements  of  V2/  that  we  choose  to  evaluate.  This  number  reflects 
the  evaluation  of  the  quantities  x,-,  De.Xj  {j  —  1,2, ... ,  n ),  and  De.etXj  for  the  NZ{V2  f ) 
vector  pairs  {ej,  ek).  The  “small  multiple”  results  from  the  fact  that  the  update  operations 
for  DpXi  and  DpqXj  may  require  a  few  times  more  operations  than  the  update  operation 
for  x,-  alone;  see,  for  example,  (8.38).  One  storage  location  per  node  of  the  graph  is  required 
for  each  of  the  1  +  n  +  Nz{  V2/)  quantities  that  are  accumulated,  but  recall  that  storage  of 
node  i  can  be  overwritten  once  all  its  children  have  been  evaluated. 

When  we  do  not  need  the  complete  Hessian,  but  only  a  matrix-vector  product  involv¬ 
ing  the  Hessian  (as  in  the  Newton-CG  algorithm  of  Chapter  7),  the  amount  of  arithmetic 
is,  of  course,  smaller.  Given  a  vector  q  e  R",  we  use  the  techniques  above  to  compute 
the  first-derivative  quantities  Deix ,, . . .  D(,n x,  and  DqX{,  as  well  as  the  second-derivative 
quantities  Deiqxt, . . . ,  DCnqXi,  during  the  forward  sweep.  The  final  node  will  contain  the 
quantities 


e]  (v2/(*))  q  =  \y2  f{x)q\j  ,  j  —  1,2, ... ,  n. 


which  are  the  components  of  the  vector  V2/(x)g.  Since  2 n  +  1  quantities  in  addition  to 
x;  are  being  accumulated  during  the  forward  sweep,  the  increase  factor  in  the  number  of 
arithmetic  operations  increases  by  a  small  multiple  of  2 n. 

An  alternative  technique  for  evaluating  sparse  Hessians  is  based  on  the  forward¬ 
mode  propagation  of  first  and  second  derivatives  of  univariate  functions.  To  motivate  this 
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approach,  note  that  the  (i,  j  )  element  of  the  Hessian  can  be  expressed  as  follows: 


[V  f(x)]u  =  e,  V  f{x)ej 
—  ~  [(e;  +  et)3 


V2/(x)(e;  +  ej)  -  ej V2/(x)e,-  -  eJV2/(x)ey  . 


(8.39) 


We  can  use  this  interpolation  formula  to  evaluate  [V2/(x)  provided  that  the  second 
derivatives  DppXk,  for  p  —  en  p  =  =  e,-  +  e,-,  and  all  nodes  x*,  have  been  evaluated 

during  the  forward  sweep  through  the  computational  graph.  In  fact,  we  can  evaluate  all  the 
nonzero  elements  of  the  Hessian,  provided  that  we  use  the  forward  mode  to  evaluate  DpXk 
and  DppXk  for  a  selection  of  vectors  p  of  the  form  e,  +  ej,  where  i  and  j  are  both  indices 
in  {1,  2, ... ,  n },  possibly  with  i  —  j . 

One  advantage  of  this  approach  is  that  it  is  no  longer  necessary  to  propagate  “cross 
terms”  of  the  form  Dpqxi  for  p  q  (see,  for  example,  (8.37)  and  (8.38c)).  The  propagation 
formulae  therefore  simplify  somewhat.  Each  DppXk  is  a  function  of  xt,  Dpxt,  and  Dppxt 
for  all  parent  nodes  l  of  node  k. 

Note,  too,  that  if  we  define  the  univariate  function  \[r  by 


f{t)  =  f(x  +  tp ), 


(8.40) 


then  the  values  of  Dpf  and  Dppf,  which  emerge  at  the  completion  of  the  forward  sweep, 
are  simply  the  first  two  derivatives  of  f  evaluated  at  t  —  0;  that  is, 

Dpf  =  pTVf(x)  =  f'{t)\t= o,  Dppf  =  pTV2f(x)p  =  f(t)  |,=o. 

Extension  of  this  technique  to  third,  fourth,  and  higher  derivatives  is  possible.  Inter¬ 
polation  formulae  analogous  to  (8.39)  can  be  used  in  conjunction  with  higher  derivatives 
of  the  univariate  functions  i/r  defined  in  (8.40),  again  for  a  suitably  chosen  set  of  vectors  p, 
where  each  p  is  made  up  of  a  sum  of  unit  vectors  e,- .  For  details,  see  Bischof,  Corliss,  and 
Griewank  [26] . 


CALCULATING  HESSIANS:  REVERSE  MODE 

We  can  also  devise  schemes  based  on  the  reverse  mode  for  calculating  Hessian- 
vector  products  W2f(x)q,  or  the  full  Hessian  V2/(x).  A  scheme  for  obtaining  V2  f(x)q 
proceeds  as  follows.  We  start  by  using  the  forward  mode  to  evaluate  both  /  and  V  f{x)T q, 
by  accumulating  the  two  variables  x,-  and  Dqx ,  during  the  forward  sweep  in  the  manner 
described  above.  We  then  apply  the  reverse  mode  in  the  normal  fashion  to  the  computed 
function  V  f(x)Tq.  At  the  end  of  the  reverse  sweep,  the  nodes  i  —  1,  2, . . . ,  n  of  the 
computational  graph  that  correspond  to  the  independent  variables  will  contain 


^(V/(x)r<7)  =  [V2/(x)4  , 


i  —  1,2 
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The  number  of  arithmetic  operations  required  to  obtain  V2f(x)q  by  this  procedure 
increases  by  only  a  modest  factor,  independent  of  n,  over  the  evaluation  of  /  alone.  By 
the  usual  analysis  for  the  forward  mode,  we  see  that  the  computation  of  /  and  V  f(x)Tq 
jointly  requires  a  small  multiple  of  the  operation  count  for  /  alone,  while  the  reverse  sweep 
introduces  a  further  factor  of  at  most  5.  The  total  increase  factor  is  approximately  12  over  the 
evaluation  of  /  alone.  If  the  entire  Hessian  V2  /  (x )  is  required,  we  could  apply  the  procedure 
just  described  with  q  —  e\,  e^, . . . ,  e„.  This  approach  would  introduce  an  additional  factor 
of  n  into  the  operation  count,  leading  to  an  increase  of  at  most  12n  over  the  cost  of  /  alone. 

Once  again,  when  the  Hessian  is  sparse  with  known  structure,  we  may  be  able  to 
use  graph-coloring  techniques  to  evaluate  this  entire  matrix  using  many  fewer  than  n  seed 
vectors.  The  choices  of  q  are  similar  to  those  used  for  finite-difference  evaluation  of  the 
Hessian,  described  above.  The  increase  in  operation  count  over  evaluating  /  alone  is  a 
multiple  of  up  to  12)VC(  V2/),  where  Nc  is  the  number  of  seed  vectors  q  used  in  calculating 
V2/. 


CURRENT  LIMITATIONS 

The  current  generation  of  automatic  differentiation  tools  has  proved  its  worth  through 
successful  application  to  some  large  and  difficult  design  optimization  problems.  However, 
these  tools  can  run  into  difficulties  with  some  commonly  used  programming  constructs  and 
some  implementations  of  computer  arithmetic.  As  an  example,  if  the  evaluation  of  fix) 
depends  on  the  solution  of  a  partial  differential  equation  (PDE),  then  the  computed  value 
of  /  may  contain  truncation  error  arising  from  the  finite-difference  or  the  finite-element 
technique  that  is  used  to  solve  the  PDE  numerically.  That  is,  we  have  f(x)  —  f(x)  +  r(x), 
where  /(•)  is  the  computed  value  of  /(•)  and  r(-)  is  the  truncation  error.  Though  \r(x)\  is 
usually  small,  its  derivative  rfx)  may  not  be,  so  the  error  in  the  computed  derivative  f'(x) 
is  potentially  large.  (The  finite- difference  approximation  techniques  discussed  in  Section  8. 1 
experience  the  same  difficulty.)  Similar  problems  arise  when  the  computer  uses  piecewise 
rational  functions  to  approximate  trigonometric  functions. 

Another  source  of  potential  difficulty  is  the  presence  of  branching  in  the  code  to 
improve  the  speed  or  accuracy  of  function  evaluation  in  certain  domains.  A  pathological 
example  is  provided  by  the  linear  function  fix)  —  x  —  1 .  If we  used  the  following  (perverse, 
but  valid)  piece  of  code  to  evaluate  this  function, 

if  ( x  —  1.0)  then  /  =  0.0  else  f  —  x  —  1.0, 

then  by  applying  automatic  differentiation  to  this  procedure  we  would  obtain  the  derivative 
value  /'(l)  =  0.  For  a  discussion  of  such  issues  and  an  approach  to  dealing  with  them,  see 
Griewank  [151,  152], 

In  conclusion,  automatic  differentiation  should  be  regarded  as  a  set  of  increasingly 
sophisticated  techniques  that  enhances  optimization  algorithms,  allowing  them  to  be  applied 
more  widely  to  practical  problems  involving  complicated  functions.  By  providing  sensitivity 
information,  it  helps  the  modeler  to  extract  more  information  from  the  results  of  the 
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computation.  Automatic  differentiation  should  not  be  regarded  as  a  panacea  that  absolves 
the  user  altogether  from  the  responsibility  of  thinking  about  derivative  calculations. 


NOTES  AND  REFERENCES 

A  comprehensive  and  authoritative  reference  on  automatic  differentiation  is  the  book 
of  Griewank  [152].  The  web  site  www.  autodiff  .  org  contains  a  wealth  of  current  infor¬ 
mation  about  theory,  software,  and  applications.  A  number  of  edited  collections  of  papers 
on  automatic  differentiation  have  appeared  since  1991;  see  Griewank  and  Corliss  [153], 
Berz  et  al.  [20],  and  Biicker  et  al.  [40].  An  historical  paper  of  note  is  Corliss  and  Rail  [78], 
which  includes  an  extensive  bibliography.  Software  tool  development  in  automatic  dif¬ 
ferentiation  makes  use  not  only  of  forward  and  reverse  modes  but  also  includes  “mixed 
modes”  and  “cross-country  algorithms”  that  combine  the  two  approaches;  see  for  example 
Naumann  [222]. 

The  field  of  automatic  differentiation  grew  considerably  during  the  1990s,  and  and  a 
number  of  good  software  tools  appeared.  These  included  ADIFOR  [25]  and  ADIC  [28] ,  and 
ADOL-C  [154].  Tools  developed  in  more  recent  years  include  TAPENADE,  which  accepts 
Fortran  code  through  a  web  server  and  returns  differentiated  code;  TAF,  a  commercial  tool 
that  also  performs  source-to-source  automatic  differentiation  of  Fortran  codes;  OpenAD, 
which  works  with  Fortran,  C,  and  C++;  and  TOMLAB/MAD,  which  works  with  MATLAB 
code. 

The  technique  for  calculating  the  gradient  of  a  partially  separable  function  was  de¬ 
scribed  by  Bischof  et  al.  [24],  whereas  the  computation  of  the  Hessian  matrix  has  been 
considered  by  several  authors;  see,  for  example,  Gay  [118]. 

The  work  of  Coleman  and  More  [69]  on  efficient  estimation  of  Hessians  was  predated 
by  Powell  and  Toint  [261],  who  did  not  use  the  language  of  graph  coloring  but  nevertheless 
devised  highly  effective  schemes.  Software  for  estimating  sparse  Hessians  and  Jacobians  is 
described  by  Coleman,  Garbow,  and  More  [66,  67].  The  recent  paper  of  Gebremedhin, 
Manne,  and  Pothen  [120]  contains  a  comprehensive  discussion  of  the  application  of  graph 
coloring  to  both  finite  difference  and  automatic  differentiation  techniques. 


i#7  Exercises 

i#7  8.1  Show  that  a  suitable  value  for  the  perturbation  6  in  the  central-difference  formula 

is  e  =  u1/3,  and  that  the  accuracy  achievable  by  this  formula  when  the  values  of  /  contain 
roundoff  errors  of  size  u  is  approximately  u2^3.  (Use  similar  assumptions  to  the  ones  used 
to  derive  the  estimate  (8.6)  for  the  forward-difference  formula.) 

i#7  8.2  Derive  a  central-difference  analogue  of  the  Hessian-vector  approximation 

formula  (8.20). 
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&  8.3  Verify  the  formula  (8.21)  for  approximating  an  element  of  the  Hessian  using  only 

function  values. 

&  8.4  Verify  that  if  the  Hessian  of  a  function  /  has  nonzero  diagonal  elements,  then  its 

adjacency  graph  is  a  subgraph  of  the  intersection  graph  for  V/.  In  other  words,  show  that 
any  arc  in  the  adjacency  graph  also  belongs  to  the  intersection  graph. 

&  8.5  Draw  the  adjacency  graph  for  the  function  /  defined  by  (8.22).  Show  that  the 

coloring  scheme  in  which  node  1  has  one  color  while  nodes  2,3,  ...  ,n  have  another  color 
is  valid.  Draw  the  intersection  graph  for  V/. 

&  8.6  Construct  the  adjacency  graph  for  the  function  whose  Hessian  has  the  nonzero 

structure 


X 

X 

X 

X 

X 

X 

X 

X 

X 

X 

X 

X 

X 

X 

x  x 

x  x 


and  find  a  valid  coloring  scheme  with  just  four  colors. 

&  8.7  Trace  the  computations  performed  in  the  forward  mode  for  the  function  fix)  in 

(8.26),  expressing  the  intermediate  derivatives  Vx,-,  i  =  4,  5 . 9  in  terms  of  quantities 

available  at  their  parent  nodes  and  then  in  terms  of  the  independent  variables  X\,  X2,  X3. 

&  8.8  Formula  (8.30)  showed  the  gradient  operations  associated  with  scalar  division. 

Derive  similar  formulae  for  the  following  operations: 


(s,  t)  —*■  s  +  t  addition; 

t  — >  e'  exponentiation; 

t  — >  tan(r)  tangent; 

(, s ,  t )  — >  s' . 


&  8.9  By  calculating  the  partial  derivatives  dxj/d.Xj  for  the  function  (8.26)  from  the 

expressions  (8.27),  verify  the  numerical  values  for  the  arcs  in  Figure  8.3  for  the  evaluation 
point  x  =  (1,2,  tt/2)t  .  Work  through  the  remaining  details  of  the  reverse  sweep  process, 
indicating  the  order  in  which  the  nodes  become  finalized. 
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8.10  Using  (8.33)  as  a  guide,  describe  the  reverse  sweep  operations  corresponding  to 
the  following  elementary  operations  in  the  forward  sweep: 

Xk  •<—  XjXj  multiplication; 

Xk  cos (x,)  cosine. 

In  each  case,  compare  the  arithmetic  workload  in  the  reverse  sweep  to  the  workload  required 
for  the  forward  sweep. 

i#7  8.1 1  Define  formulae  similar  to  (8.37)  for  accumulating  the  first  derivatives  Dpxt  and 

the  second  derivatives  DpqXj  when  x,-  is  obtained  from  the  following  three  binary  operations: 
Xi  —  Xj  —  Xk,  Xi  —  XjXk,  and  x,-  =  Xj/xk- 

i#7  8.12  By  using  the  definitions  (8.28)  of  DpXj  and  (8.36)  of  DpqXj,  verify  the 

differentiation  formulae  (8.38)  for  the  unitary  operation  x,-  =  L(xj). 

i#7  8.13  Let  a  €  R"  be  a  fixed  vector  and  define  /  as  /(x)  =  \  (xTx  +  (flrx)~^.  Count 

the  number  of  operations  needed  to  evaluate  /,  V /,  V2/,  and  the  Hessian-vector  product 
V2/(x)p  for  an  arbitrary  vector  p. 


Chapter 


Derivative-Free 

Optimization 


Many  practical  applications  require  the  optimization  of  functions  whose  derivatives  are 
not  available.  Problems  of  this  kind  can  be  solved,  in  principle,  by  approximating  the 
gradient  (and  possibly  the  Hessian)  using  finite  differences  (see  Chapter  8),  and  using  these 
approximate  gradients  within  the  algorithms  described  in  earlier  chapters.  Even  though 
this  finite-difference  approach  is  effective  in  some  applications,  it  cannot  be  regarded  a 
general-purpose  technique  for  derivative-free  optimization  because  the  number  of  function 
evaluations  required  can  be  excessive  and  the  approach  can  be  unreliable  in  the  presence 
of  noise.  (For  the  purposes  of  this  chapter  we  define  noise  to  be  inaccuracy  in  the  function 
evaluation.)  Because  of  these  shortcomings,  various  algorithms  have  been  developed  that 
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do  not  attempt  to  approximate  the  gradient.  Rather,  they  use  the  function  values  at  a  set  of 
sample  points  to  determine  a  new  iterate  by  some  other  means. 

Derivative -free  optimization  (DFO)  algorithms  differ  in  the  way  they  use  the  sampled 
function  values  to  determine  the  new  iterate.  One  class  of  methods  constructs  a  linear 
or  quadratic  model  of  the  objective  function  and  defines  the  next  iterate  by  seeking  to 
minimize  this  model  inside  a  trust  region.  We  pay  particular  attention  to  these  model-based 
approaches  because  they  are  related  to  the  unconstrained  minimization  methods  described 
in  earlier  chapters.  Other  widely  used  DFO  methods  include  the  simplex- reflection  method 
of  Nelder  and  Mead,  pattern-search  methods,  conjugate-direction  methods,  and  simulated 
annealing.  In  this  chapter  we  briefly  discuss  these  methods,  with  the  exception  of  simulated 
annealing,  which  is  a  nondeterministic  approach  and  has  little  in  common  with  the  other 
techniques  discussed  in  this  book. 

Derivative -free  optimization  methods  are  not  as  well  developed  as  gradient-based 
methods;  current  algorithms  are  effective  only  for  small  problems.  Although  most  DFO 
methods  have  been  adapted  to  handle  simple  types  of  constraints,  such  as  bounds,  the 
efficient  treatment  of  general  constraints  is  still  the  subject  of  investigation.  Consequently, 
we  limit  our  discussion  to  the  unconstrained  optimization  problem 

min  f{x).  (9.1) 

xeRn 

Problems  in  which  derivatives  are  not  available  arise  often  in  practice.  The  evaluation 
of  fix)  can,  for  example,  be  the  result  of  an  experimental  measurement  or  a  stochastic 
simulation,  with  the  underlying  analytic  form  of  /  unknown.  Even  if  the  objective  function 
/  is  known  in  analytic  form,  coding  its  derivatives  may  be  time  consuming  or  impractical. 
Automatic  differentiation  tools  (Chapter  8)  may  not  be  applicable  if  fix)  is  provided  only 
in  the  form  of  binary  computer  code.  Even  when  the  source  code  is  available,  these  tools 
cannot  be  applied  if  the  code  is  written  in  a  combination  of  languages. 

Methods  for  derivative- free  optimization  are  often  used  (with  mixed  success)  to 
minimize  problems  with  nondifferentiable  functions  or  to  try  to  locate  the  global  minimizer 
of  a  function.  Since  we  do  not  treat  nonsmooth  optimization  or  global  optimization  in  this 
book,  we  will  restrict  our  attention  to  smooth  problems  in  which  /  has  a  continuous 
derivative.  We  do,  however,  discuss  the  effects  of  noise  in  Sections  9.1  and  9.6. 


9.1  FINITE  DIFFERENCES  AND  NOISE 


As  mentioned  above,  an  obvious  DFO  approach  is  to  estimate  the  gradient  by  using  finite 
differences  and  then  employ  a  gradient-based  method.  This  approach  is  sometimes  successful 
and  should  always  be  considered,  but  the  finite-difference  estimates  can  be  inaccurate  when 
the  objective  function  contains  noise.  We  quantify  the  effect  of  noise  in  this  section. 

Noise  can  arise  in  function  evaluations  for  various  reasons.  If  fix)  depends  on  a 
stochastic  simulation,  there  will  be  a  random  error  in  the  evaluated  function  because  of  the 
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finite  number  of  trials  in  the  simulation.  When  a  differential  equation  solver  or  some  other 
complex  numerical  procedure  is  needed  to  calculate  /,  small  but  nonzero  error  tolerances 
that  are  used  during  the  calculations  will  produce  noise  in  the  value  of  /. 

In  many  applications,  then,  the  objective  function  /  has  the  form 

f(x)  —  Hx)  +  4>(x),  (9.2) 


where  h  is  a  smooth  function  and  0  represents  the  noise.  Note  that  we  have  written  0  to  be 
a  function  of  x  but  in  practice  it  need  not  be.  For  instance,  if  the  evaluation  of  /  depends 
on  a  simulation,  the  value  of  0  will  generally  differ  at  each  evaluation,  even  at  the  same  x. 
The  form  (9.2)  is,  however,  useful  for  illustrating  some  of  the  difficulties  caused  by  noise  in 
gradient  estimates  and  for  developing  algorithms  for  derivative-free  optimization. 

Given  a  difference  interval  e,  recall  that  the  centered  finite-difference  approximation 
(8.7)  to  the  gradient  of  /  at  x  is  defined  as  follows: 


Ve/(*)  = 


f(x  +  eej)  -  f{x  -  eej) 
2e 


i=l,2,...,n 


(9.3) 


where  e,  is  the  zth  unit  vector  (the  vector  whose  only  nonzero  element  is  a  1  in  the  ith 
position) .  We  wish  to  relate  Ve  /  (x )  to  the  gradient  of  the  underlying  smooth  function  h(x), 
as  a  function  of  e  and  the  noise  level.  For  this  purpose  we  define  the  noise  level  ?;  to  be  the 
largest  value  of  0  in  a  box  of  edge  length  2e  centered  at  x,  that  is, 


r)ix;e)=  sup  |0(z)|. 
I|z-*lloo<< 


(9.4) 


By  applying  to  the  central  difference  formula  (9.3)  the  argument  that  led  to  (8.5),  we  can 
establish  the  following  result. 

Lemma  9.1. 

Suppose  thatV2h  is  Lipschitz  continuous  in  a  neighborhood  of  the  box  {z\  ||z  — *||oo  <  e} 
with  Lipschitz  constant  Lh.  Then  we  have 

l|V6/(x)  -  V/z(.r)||oo  <  Lhe2  +  (9.5) 

e 

Thus  the  error  in  the  approximation  (9.3)  comes  from  both  the  intrinsic  finite  difference 
approximation  error  (the  0(e2)  term)  and  the  noise  (the  tj(jt;  e)/e  term).  If  the  noise 
dominates  the  difference  interval  e,  we  cannot  expect  any  accuracy  at  all  in  Ve/(x),  so  it 
will  only  be  pure  luck  if — Ve  fix)  turns  out  to  be  a  direction  of  descent  for  /. 

Instead  of  computing  a  tight  cluster  of  function  values  around  the  current  iterate,  as 
required  by  a  finite-  difference  approximation  to  the  gradient,  it  may  be  preferable  to  separate 
these  points  more  widely  and  use  them  to  construct  a  model  of  the  objective  function.  This 
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approach,  which  we  consider  in  the  next  section  and  in  Section  9.6,  may  be  more  robust  to 
the  presence  of  noise. 


9.2  MODEL-BASED  METHODS 


Some  of  the  most  effective  algorithms  for  unconstrained  optimization  described  in  the 
previous  chapters  compute  steps  by  minimizing  a  quadratic  model  of  the  objective  function 
/.  The  model  is  formed  by  using  function  and  derivative  information  at  the  current  iterate. 
When  derivatives  are  not  available,  we  may  define  the  model  mk  as  the  quadratic  function 
that  interpolates  /  at  a  set  of  appropriately  chosen  sample  points.  Since  such  a  model  is 
usually  nonconvex,  the  model-based  methods  discussed  in  this  chapter  use  a  trust-region 
approach  to  compute  the  step. 

Suppose  that  at  the  current  iterate  xk  we  have  a  set  of  sample  points  Y  — 
{y1,  y2, . . . ,  y9},  with  y'  e  R",  i  —  1,2, ...  ,q.  We  assume  that  **  is  an  element  of 
this  set  and  that  no  point  in  Y  has  a  lower  function  value  than  x k.  We  wish  to  construct  a 
quadratic  model  of  the  form 

mk(xk  +  p)=c  +  gT  p  +  \pTGp.  (9.6) 

We  cannot  define  g  =  V/(jc*)  and  G  —  V2f(xk)  because  these  derivatives  are  not  available. 
Instead,  we  determine  the  scalar  c,  the  vector  g  e  R",  and  the  symmetric  matrix  G  e  Rnxn 
by  imposing  the  interpolation  conditions 

mk(yl)  —  f(y'),  l  —  l,2,...,q.  (9.7) 

Since  there  are  \(n  +  1)(«  +  2)  coefficients  in  the  model  (9.6)  (that  is,  the  components  of 
c,  g,  and  G,  taking  into  account  the  symmetry  of  G),  the  interpolation  conditions  (9.7) 
determine  mk  uniquely  only  if 


q  =  \{n  +  1)(«  +  2).  (9.8) 

In  this  case,  (9.7)  can  be  written  as  a  square  linear  system  of  equations  in  the  coefficients  of 
the  model.  If  we  choose  the  interpolation  points  yl,y2, ...  ,yq  so  that  this  linear  system  is 
nonsingular,  the  model  mk  will  be  uniquely  determined. 

Once  mk  has  been  formed,  we  compute  a  step  p  by  approximately  solving  the  trust- 
region  subproblem 


min  mk{xk  +  p),  subjectto  \\p\\i  <  A,  (9.9) 

p 


for  some  trust-region  radius  A  >  0.  We  can  use  one  of  the  techniques  described  in  Chapter  4 
to  solve  this  subproblem.  If  xk  +  p  gives  a  sufficient  reduction  in  the  objective  function, 
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the  new  iterate  is  defined  as  xk+ 1  =  xk  +  p,  the  trust  region  radius  A  is  updated,  and  a 
new  iteration  commences.  Otherwise  the  step  is  rejected,  and  the  interpolation  set  Y  may 
be  improved  or  the  trust  region  shrunk. 

To  reduce  the  cost  of  the  algorithm,  we  update  the  model  at  every  iteration, 
rather  than  recomputing  it  from  scratch.  In  practice,  we  choose  a  convenient  basis  for  the 
space  of  quadratic  polynomials,  the  most  common  choices  being  Lagrange  and  Newton 
polynomials.  The  properties  of  these  bases  can  be  used  both  to  measure  appropriateness  of 
the  sample  set  Y  and  to  change  this  set  if  necessary.  A  complete  algorithm  that  treats  all 
these  issues  effectively  is  far  more  complicated  than  the  quasi-Newton  methods  discussed 
in  Chapter  6.  Consequently,  we  will  provide  only  a  broad  outline  of  model-based  DFO 
methods. 

As  is  common  in  trust-region  algorithms,  the  step-acceptance  and  trust-region  update 
strategies  are  based  on  the  ratio  between  the  actual  reduction  in  the  function  and  the 
reduction  predicted  by  the  model,  that  is, 


f(Xk  )  ~  f(Xk) 
mk(xk)  -  mk(x£)' 


(9.10) 


where  x£  denotes  the  trial  point.  Throughout  this  section,  the  integer  q  is  defined 
by  (9.8). 


Algorithm  9.1  (Model-Based  Derivative-Free  Method). 

Choose  an  interpolation  set  Y  —  {y1,  y2, . . . ,  yq }  such  that  the  linear  system  defined 
by  (9.7)  is  nonsingular,  and  select  xg  as  a  point  in  this  set  such  that  f(x o)  <  /(/)  for  all 
y'  e  Y .  Choose  an  initial  trust  region  radius  A0,  a  constant  q  e  (0,  1),  and  set  k  a-  0. 

repeat  until  a  convergence  test  is  satisfied: 

Form  the  quadratic  model  mk{xk  +  p)  that  satisfies  the  interpolation 
conditions  (9.7); 

Compute  a  step  p  by  approximately  solving  subproblem  (9.9); 

Define  the  trial  point  as  —  xk  +  p; 

Compute  the  ratio  p  defined  by  (9.10); 
if  p  >  q 

Replace  an  element  of  Y  by  x£; 

Choose  Afc+i  >  A*; 

Set  xk+i  x£ ; 

Set  k  <—  k  +  1  and  go  to  the  next  iteration; 
else  if  the  set  Y  need  not  be  improved 
Choose  Afc+i  <  A*; 

Set  x k-\- 1  4  xk , 

Set  k  4r-  k  +  1  and  go  to  the  next  iteration; 

end  (if) 
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Invoke  a  geometry- improving  procedure  to  update  Y : 

at  least  one  of  the  points  in  Y  is  replaced  by  some  other  point, 
with  the  goal  of  improving  the  conditioning  of  (9.7); 

Set  Afc+i  <  A/;, 

Choose  x  as  an  element  in  Y  with  lowest  function  value; 

Set  x  and  recompute  p  by  (9.10); 
if  p  >  r) 

Set  Xk+i  <-  x £ ; 

else 

Set  Xk+\  *  Xki 

end  (if) 

Set  k  +-  k  - T  1; 

end  (repeat) 


The  case  of  p  >  rj,  in  which  we  obtain  sufficient  reduction  in  the  merit  function,  is 
the  simplest.  In  this  case  we  always  accept  the  trial  point  as  the  new  iterate,  include  x jj~ 
in  Y,  and  remove  an  element  from  Y. 

When  sufficient  reduction  is  not  achieved  (p  <  ?;),  we  look  at  two  possible  causes: 
inadequacy  of  the  interpolation  set  Y  and  a  trust  region  that  is  too  large.  The  first  cause 
can  arise  when  the  iterates  become  restricted  to  a  low-dimensional  surface  of  R"  that  does 
not  contain  the  solution.  The  algorithm  could  then  be  converging  to  a  minimizer  in  this 
subset.  Behavior  such  as  this  can  be  detected  by  monitoring  the  conditioning  of  the  linear 
system  defined  by  the  interpolation  conditions  (9.7).  If  the  condition  number  is  too  high, 
we  change  Y  to  improve  it,  typically  by  replacing  one  element  of  Y  with  a  new  element 
so  as  to  move  the  interpolation  system  (9.7)  as  far  away  from  singularity  as  possible.  If  Y 
seems  adequate,  we  simply  decrease  the  trust  region  radius  A,  as  is  done  in  the  methods  of 
Chapter  4. 

A  good  initial  choice  for  Y  is  given  by  the  vertices  and  the  midpoints  of  the  edges  of  a 
simplex  in  R" . 

The  use  of  quadratic  models  limits  the  size  of  problems  that  can  be  solved  in 
practice.  Performing  0(n2)  function  evaluations  just  to  start  the  algorithm  is  onerous, 
even  for  moderate  values  of  n  (say,  n  =  50).  In  addition,  the  cost  of  the  iteration  is 
high.  Even  by  updating  the  model  m*  at  every  iteration,  rather  than  recomputing  it 
from  scratch,  the  number  of  operations  required  to  construct  and  compute  a  step 
is  0{n4)  [257], 

To  alleviate  these  drawbacks,  we  can  replace  the  quadratic  model  by  a  linear  model  in 
which  the  matrix  G  in  (9.6)  is  set  to  zero.  Since  such  a  model  contains  only  n  + 1  parameters, 
we  need  to  retain  only  n  +  1  interpolation  points  in  the  set  Y ,  and  the  cost  of  each  iteration  is 
0{n 3).  Algorithm  9.1  can  be  applied  with  little  modification  when  the  model  is  linear,  but  it 
is  not  rapidly  convergent  because  linear  models  cannot  represent  curvature  of  the  problem. 
Therefore,  some  model-based  algorithms  start  with  n  +  1  initial  points  and  compute  steps 
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using  a  linear  model,  but  after  q  =  |(n  +  l)(n  +  2)  function  values  become  available,  they 
switch  to  using  quadratic  models. 

INTERPOLATION  AND  POLYNOMIAL  BASES 

We  now  consider  in  more  detail  how  to  form  a  model  of  the  objective  function  using 
interpolation  techniques.  We  begin  by  considering  a  linear  model  of  the  form 

mk(xk  +  p)  —  f(xk)  +  gT  p.  (9.11) 

To  determine  the  vector  g  e  R",  we  impose  the  interpolation  conditions  mk(y')  —  fiy1), 
I  —  1,2 , ,n,  which  can  be  written  as 

(sl)Tg  =  f(yl)  -  f(xk),  l  —  1,2,  ...  ,n,  (9.12) 

where 

s'  =  yl-xk,  l  —  1,2, ...  ,n.  (9.13) 

Conditions  (9.12)  represent  a  linear  system  of  equations  in  which  the  rows  of  the  coefficient 
matrix  are  given  by  the  vectors  (s')T .  It  follows  that  the  model  (9.11)  is  determined  uniquely 
by  (9.12)  if  and  only  if  the  interpolation  points  {y1,  y2,  . . . ,  yn }  are  such  that  the  set 
{s1  :  l  —  1,  2, . . . ,  n}  is  linearly  independent.  If  this  condition  holds,  the  simplex  formed  by 
the  points  xk,  y1,  y2, . . . ,  yn  is  said  to  be  nondegenerate. 

Let  us  now  consider  how  to  construct  a  quadratic  model  of  the  form  (9.6),  with 
/  =  f(x k).  We  rewrite  the  model  as 


mk(xk  +  p )  =  f(xk)  +  gT  p  +  ^  GjjPiPj  +  Git 


1<J 


def 


=  f(xk)  +  g  P, 


(9.14) 

(9.15) 


where  we  have  collected  the  elements  of  g  and  G  in  the  (q  —  1) -vector  of  unknowns 

l^(gr.{Gy}/<;.{^G„J)r,  (9.16) 

and  where  the  (q  —  1) -vector  p  is  given  by 

p  =  (pTAPiPjh<j,{^pf})  ■ 

The  model  (9.15)  has  the  same  form  as  (9.11),  and  the  determination  of  the  vector  of 
unknown  coefficients  g  can  be  done  as  in  the  linear  case. 
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Multivariate  quadratic  functions  can  be  represented  in  various  ways.  The  monomial 
basis  (9.14)  has  the  advantage  that  known  structure  in  the  Hessian  can  be  imposed  easily  by 
setting  appropriate  elements  in  G  to  zero.  Other  bases  are,  however,  more  convenient  when 
one  is  developing  mechanisms  for  avoiding  singularity  of  the  system  (9.7). 

We  denote  by  {</>,- (-)}f=1  a  basis  for  the  linear  space  of  H-dimensional  quadratic 
functions.  The  function  (9.6)  can  therefore  be  expressed  as 

i 

mk{x)  —  £><&(*), 

i= 1 


for  some  coefficients  a,-.  The  interpolation  set  F  =  {y1,  y2, . . . ,  yq]  determines  the 
coefficients  a,  uniquely  if  the  determinant 


5(F)  =  det 


(  <P\  (.V1) 


V  <Pq(yl) 


Myq)  ^ 
<pq(yq)  / 


(9.17) 


is  nonzero. 

As  model-based  algorithms  iterate,  the  determinant  5(F)  may  approach  zero,  leading 
to  numerical  difficulties  or  even  failure.  Several  algorithms  therefore  contain  a  mechanism 
for  keeping  the  interpolation  points  well  placed.  We  now  describe  one  of  those  mechanisms. 

UPDATING  THE  INTERPOLATION  SET 

Rather  than  waiting  until  the  determinant  5(F)  becomes  smaller  than  a  threshold, 
we  may  invoke  a  geometry-improving  procedure  whenever  a  trial  point  does  not  provide 
sufficient  decrease  in  /.  The  goal  in  this  case  is  to  replace  one  of  the  interpolation  points 
so  that  the  determinant  (9.17)  increases  in  magnitude.  To  guide  us  in  this  exchange,  we  use 
the  following  property  of  5(F),  which  we  state  in  terms  of  Lagrange  functions. 

For  every  y  e  F,  we  define  the  Lagrangian  function  L(-,  y)  to  be  a  polynomial  of 
degree  at  most  2  such  that  L(y,  y)  —  1  and  L(y,  y)  —  0  for  y  /  y,  y  e  Y.  Suppose  that  the 
set  F  is  updated  by  removing  a  point  y_  and  replacing  it  by  some  other  point  y+,  to  give  the 
new  set  F+.  One  can  show  that  (after  a  suitable  normalization  and  given  certain  conditions 
[256]) 


|5(F+)|  <  \L(y+,  y_)|  |5(F)|.  (9.18) 

Algorithm  9.1  can  make  good  use  of  this  inequality  to  update  the  interpolation  set. 

Consider  first  the  case  in  which  trial  point  x+  provides  sufficient  reduction  in  the 
objective  function  (p  >  rj).  We  include  x+  in  F  and  remove  another  point  y_  from  F. 
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Motivated  by  (9.18),  we  select  the  outgoing  point  as  follows: 


v_  =  argmax  \L(x+,  y)|. 

yeY 


Next,  let  us  consider  the  case  in  which  the  reduction  in  /  is  not  sufficient  (p  <  /;). 
We  first  determine  whether  the  set  Y  should  be  improved,  and  for  this  purpose  we  use 
the  following  rule.  We  consider  Y  to  be  adequate  at  the  current  iterate  xu  if  for  all  y'  e  Y 
such  that  \\xk  —  yl  ||  <  A  we  have  that  |S(F)|  cannot  be  doubled  by  replacing  one  of  these 
interpolation  points  y'  with  any  point  y  inside  the  trust  region.  If  Y  is  adequate  but  the 
reduction  in  /  was  not  sufficient,  we  decrease  the  trust-region  radius  and  begin  a  new 
iteration. 

If  Y  is  inadequate,  the  geometry- improving  mechanism  is  invoked.  We  choose  a  point 
y_  e  Y  and  replace  it  by  some  other  point  y+  that  is  chosen  solely  with  the  objective 
of  improving  the  determinant  (9.17).  For  every  point  y'  e  Y,  we  define  its  potential 
replacement  y‘r  as 


Yl  =  arg  max  \L{y,y')\. 
lb— *dl<A 


The  outgoing  point  y_  is  selected  as  the  point  for  which  | L(ylr,  y')|  is  maximized  over  all 
indices  yl  e  Y . 

Implementing  these  rules  efficiently  in  practice  is  not  simple,  and  one  must  also 
consider  several  possible  difficulties  we  have  not  discussed;  see  [76] .  Strategies  for  improving 
the  position  of  the  interpolation  set  are  the  subject  of  ongoing  investigation  and  new 
developments  are  likely  in  the  coming  years. 


A  METHOD  BASED  ON  MINIMUM-CHANGE  UPDATING 

We  now  consider  a  method  that  be  viewed  as  an  extension  of  the  quasi-Newton 
approach  discussed  in  Chapter  6.  The  method  uses  quadratic  models  but  requires  only 
0(n 3)  operations  per  iteration,  substantially  fewer  than  the  0(n4)  operations  required  by 
the  methods  described  above.  To  achieve  this  economy,  the  method  retains  only  O  (n )  points 
for  the  interpolation  conditions  (9.7)  and  absorbs  the  remaining  degrees  of  freedom  in  the 
model  (9.6)  by  requiring  that  the  Hessian  of  the  model  change  as  little  as  possible  from 
one  iteration  to  the  next.  This  least-change  property  is  one  of  the  key  ingredients  in  quasi- 
Newton  methods,  the  other  ingredient  being  the  requirement  that  the  model  interpolate 
the  gradient  V  /  at  the  two  most  recent  points.  The  method  we  describe  now  combines  the 
least-change  property  with  interpolation  of  function  values. 

At  the  £th  iteration  of  the  algorithm,  a  new  quadratic  model  mk+\  of  the  form  (9.6) 
is  constructed  after  taking  a  step  from  xk  to  xk+i.  The  coefficients  fk+i,  gk+u  G/t+i  of  the 
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model  nik+i  are  determined  as  the  solution  of  the  problem 

min  \\G  —  Gk\\\  (9.19a) 

f.g.G 

subject  to  G  symmetric 

m(y‘)  =  f(yl)  l=l,2,...,q,  (9.19b) 

where  ||  •  ||f  denotes  the  Frobenius  norm  (see  (A.9)),  Gk  is  the  Hessian  of  the  previous 
model  m*,  and  q  is  an  integer  comparable  to  n.  One  can  show  that  the  integer  q  must  be 
chosen  larger  than  n  +  1  to  guarantee  that  Gk+\  is  not  equal  to  Gk  -  An  appropriate  value  in 
practice  is  q  =  In  +  1;  for  this  choice  the  number  of  interpolation  points  is  roughly  twice 
that  used  for  linear  models. 

Problem  (9.19)  is  an  equality-constrained  quadratic  program  whose  KKT  conditions 
can  be  expressed  as  a  system  of  equations.  Once  the  model  nik+i  is  determined,  we  compute 
a  new  step  by  solving  a  trust-region  problem  of  the  form  (9.9).  In  this  approach,  too, 
it  is  necessary  to  ensure  that  the  geometry  of  the  interpolation  set  Y  is  adequate.  We 
therefore  impose  two  minimum  requirements.  First,  the  set  Y  should  be  such  that  the 
equations  (9.19b)  can  be  satisfied  for  any  right-hand  side.  Second,  the  points  y'  should 
not  all  lie  in  a  hyperplane.  If  these  two  conditions  hold,  problem  (9.19)  has  a  unique 
solution. 

A  practical  algorithm  based  on  the  subproblem  (9.19)  resembles  Algorithm  9. 1  in  that 
it  contains  procedures  both  for  generating  new  iterates  and  for  improving  the  geometry 
of  the  set  Y.  The  implementation  described  in  [260]  contains  other  features  to  ensure  that 
the  interpolation  points  are  well  separated  and  that  steps  are  not  too  small.  A  strength  of 
this  method  is  that  it  requires  only  O(n)  interpolation  points  to  start  producing  productive 
steps.  In  practice  the  method  often  approaches  a  solution  with  fewer  than  ^  ( n  +  1 )(«  +2) 
function  evaluations.  However,  since  this  approach  has  been  developed  only  recently,  there 
is  insufficient  numerical  experience  to  assess  its  full  potential. 
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Rather  than  constructing  a  model  of  /  explicitly  based  on  function  values,  coordinate  search 
and  pattern-search  methods  look  along  certain  specified  directions  from  the  current  iterate 
for  a  point  with  a  lower  function  value.  If  such  a  point  is  found,  they  step  to  it  and  repeat  the 
process,  possibly  modifying  the  directions  of  search  for  the  next  iteration.  If  no  satisfactory 
new  point  is  found,  the  step  length  along  the  current  search  directions  may  be  adjusted,  or 
new  search  directions  may  be  generated. 

We  describe  first  a  simple  approach  of  this  type  that  has  been  used  often  in  practice. 
We  then  consider  a  generalized  approach  that  is  potentially  more  efficient  and  has  stronger 
theoretical  properties. 
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Figure  9.1 

Coordinate  search  method  makes  slow 
progress  on  this  function  of  two  variables. 


COORDINATE  SEARCH  METHOD 

The  coordinate  search  method  (also  known  as  the  coordinate  descent  method  or 
the  alternating  variables  method)  cycles  through  the  n  coordinate  directions  e\,  fi2>  •  •  • ,  e„, 
obtaining  new  iterates  by  performing  a  line  search  along  each  direction  in  turn.  Specifically, 
at  the  first  iteration,  we  fix  all  components  of  x  except  the  first  one  X\  and  find  a  new  value 
of  this  component  that  minimizes  (or  at  least  reduces)  the  objective  function.  On  the  next 
iteration,  we  repeat  the  process  with  the  second  component  X2,  and  so  on.  After  n  iterations, 
we  return  to  the  first  variable  and  repeat  the  cycle.  Though  simple  and  somewhat  intuitive, 
this  method  can  be  quite  inefficient  in  practice,  as  we  illustrate  in  Figure  9.1  for  a  quadratic 
function  in  two  variables.  Note  that  after  a  few  iterations,  neither  the  vertical  {xi)  nor  the 
horizontal  {x\ )  move  makes  much  progress  toward  the  solution  at  each  iteration. 

In  general,  the  coordinate  search  method  can  iterate  infinitely  without  ever  approach¬ 
ing  a  point  where  the  gradient  of  the  objective  function  vanishes,  even  when  exact  line 
searches  are  used.  (By  contrast,  as  we  showed  in  Section  3.2,  the  steepest  descent  method 
produces  a  sequence  of  iterates  {v^}  for  which  ||  V  /*  ||  -»  0,  under  reasonable  assumptions.) 
In  fact,  a  cyclic  search  along  any  set  of  linearly  independent  directions  does  not  guarantee 
global  convergence  [243] .  Technically  speaking,  this  difficulty  arises  because  the  steepest  de¬ 
scent  search  direction  —  V/*  may  become  more  and  more  perpendicular  to  the  coordinate 
search  direction.  In  such  circumstances,  the  Zoutendijk  condition  (3.14)  is  satisfied  because 
cos  6k  approaches  zero  rapidly,  even  when  V  fk  does  not  approach  zero. 

When  the  coordinate  search  method  does  converge  to  a  solution,  it  often  converges 
much  more  slowly  than  the  steepest  descent  method,  and  the  difference  between  the  two 
approaches  tends  to  increase  with  the  number  of  variables.  However,  coordinate  search  may 
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still  be  useful  because  it  does  not  require  calculation  of  the  gradient  V/*,  and  the  speed 
of  convergence  can  be  quite  acceptable  if  the  variables  are  loosely  coupled  in  the  objective 
function  /. 

Many  variants  of  the  coordinate  search  method  have  been  proposed,  some  of  which 
allow  a  global  convergence  property  to  be  proved.  One  simple  variant  is  a  “back-and-forth” 
approach  in  which  we  search  along  the  sequence  of  directions 


ei,e2, ,  e„-U  e„,  e„-i, - e2,e1,e2,...  (repeats). 

Another  approach,  suggested  by  Figure  9.1,  is  first  to  perform  a  sequence  of  coordinate 
descent  steps  and  then  search  along  the  line  joining  the  first  and  last  points  in  the  cycle.  Several 
algorithms,  such  as  that  of  Hooke  and  Jeeves,  are  based  on  these  ideas;  see  Fletcher  [101] 
and  Gill,  Murray,  and  Wright  [130]. 

The  pattern-search  approach,  described  next,  generalizes  coordinate  search  in  that  it 
allows  the  use  of  a  richer  set  of  search  directions  at  each  iteration. 


PATTERN-SEARCH  METHODS 

We  consider  pattern-search  methods  that  choose  a  certain  set  of  search  directions 
at  each  iterate  and  evaluate  /  at  a  given  step  length  along  each  of  these  directions.  These 
candidate  points  form  a  “frame,”  or  “stencil,”  around  the  current  iterate.  If  a  point  with  a 
significantly  lower  function  value  is  found,  it  is  adopted  as  the  new  iterate,  and  the  center 
of  the  frame  is  shifted  to  this  new  point.  Whether  shifted  or  not,  the  frame  may  then  be 
altered  in  some  way  (the  set  of  search  directions  may  be  changed,  or  the  step  length  may 
grow  or  shrink),  and  the  process  repeats.  For  certain  methods  of  this  type  it  is  possible 
to  prove  global  convergence  results — typically,  that  there  exists  a  stationary  accumulation 
point. 

The  presence  of  noise  or  other  forms  of  inexactness  in  the  function  values  may  affect 
the  performance  of  pattern-search  algorithms  and  certainly  impacts  the  convergence  theory. 
Nonsmoothness  may  also  cause  undesirable  behavior,  as  can  be  shown  by  simple  examples, 
although  satisfactory  convergence  is  often  observed  on  nonsmooth  problems. 

To  define  pattern-search  methods,  we  introduce  some  notation.  For  the  current  iterate 
Xk,  we  define  T>k  to  be  the  set  of  possible  search  directions  and  Yk  to  be  the  line  search 
parameter.  The  frame  consists  of  the  points  Xj,  +  YkPk,  for  all  p k  e  V k-  When  one  of  the 
points  in  the  frame  yields  a  significant  decrease  in  /,  we  take  the  step  and  may  also  increase 
Yk,  so  as  to  expand  the  frame  for  the  next  iteration.  If  none  of  the  points  in  the  frame  has  a 
significantly  better  function  value  than  />,  we  reduce  Yk  (contract  the  frame),  set  Xk+\  =  Xk, 
and  repeat.  In  either  case,  we  may  change  the  direction  set  T>k  prior  to  the  next  iteration, 
subject  to  certain  restrictions. 

A  more  precise  description  of  the  algorithm  follows. 
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Algorithm  9.2  (Pattern-Search). 

Given  convergence  tolerance  yto i>  contraction  parameter  dmax, 

sufficient  decrease  function  p  :  [0,  oo)  — >■  R  with  p{t)  an  increasing 
function  of  t  and  p(t)/t  ->  0  as  t  4,  0; 

Choose  initial  point  x0>  initial  step  length  y0  >  yto i,  initial  direction  set  T>0\ 
for  k  =  1,2,... 
if  Yk  <  /tol 
stop; 

if  f(xk  +  YkPk)  <  f(xk)  -  p(n-)  for  some  pk  e  Vk 
Set  xk+\  <—  xk  +  yk pk  for  some  such  pk\ 

Set  yk+x  < —  cpkyk  for  some  <pk  >  1;  (*  increase  step  length  *) 

else 

Set  xk+\  *  xk , 

Set  yk+ 1  <  9kyk,  where  0  <  9k  C  dmax  <  1, 

end  (if) 

end  (for) 

A  wise  choice  of  the  direction  set  Vk  is  crucial  to  the  practical  behavior  of  this  approach 
and  to  the  theoretical  results  that  can  be  proved  about  it.  A  key  condition  is  that  at  least  one 
direction  in  this  set  should  give  a  direction  of  descent  for  /  whenever  V  f(xk)  ^  0  (that  is, 
whenever  xk  is  not  a  stationary  point).  To  make  this  condition  specific,  we  refer  to  formula 
(3.12),  where  we  defined  the  angle  between  a  possible  search  direction  d  and  the  gradient 
V  fk  as  follows: 


cosd  = 


-V//  P 

liv/*||  Mf 


(9.20) 


Recall  from  Theorem  3.2  that  global  convergence  of  a  line-search  method  to  a  stationary 
point  of  /  could  be  ensured  if  the  search  direction  d  at  each  iterate  xk  satisfied  cos  0  >  <5, 
for  some  constant  S  >  0,  and  if  the  line  search  parameter  satisfied  certain  conditions.  In 
the  same  spirit,  we  choose  T>k  so  that  at  least  one  direction  p  e  Vk  will  yield  cosd  >  <5, 
regardless  of  the  value  of  V  fk.  This  condition  is  as  follows: 


,  def  .  vTp 

K\Uk)  =  minmax - 

ueR"  peV k  ||  V  ||  \\p 


>  s. 


(9.21) 


A  second  condition  on  Vk  is  that  the  lengths  of  the  vectors  in  this  set  are  all  roughly 
similar,  so  that  the  diameter  of  the  frame  formed  by  this  set  is  captured  adequately  by  the 
step  length  parameter  yk .  Thus,  we  impose  the  condition 


/Imin  A  I  P\\  Z:  dmax ,  for  all  p  G  T)k , 


(9.22) 


for  some  positive  constants  /3min  and  dmax  and  all  k.  If  the  conditions  (9.21)  and  (9.22)  hold, 
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we  have  for  any  k  that 

-V  fl  p  >  k  CD* )  ||  V/*  ||  ||/7 1|  >  5^minl|V/*||,  for  some/?  e  Vk. 

Examples  of  sets!?*  that  satisfy  the  properties  (9.21)  and  (9.22)  include  the  coordinate 
direction  set 


{eu  e2, . .  ■ ,  en,  -eu  -e2,  ...,  - e„ },  (9.23) 

and  the  set  of  n  +  1  vectors  defined  by 

Pi  —  -  a,  i  —  1,2, ,  n\  p„+ 1  =  ]-e,  (9.24) 

2  n  2  n 

where  e  —  (1,  1, . . . ,  l)7.  For  n  =  3  these  direction  sets  are  sketched  in  Figure  9.2. 

The  coordinate  descent  method  described  above  is  similar  to  the  special  case  of 
Algorithm  9.2  obtained  by  setting  Vk  —  {e;,  —  e,}  for  some!  =  1,2, ...  ,n  at  each  iteration. 
Note  that  for  this  choice  of  T>k,  we  have  K(Dk)  —  0  for  all  k.  Hence,  as  noted  above,  cos 6 
can  be  arbitrarily  close  to  zero  at  each  iteration. 

Often,  the  directions  that  satisfy  the  properties  (9.21)  and  (9.22)  form  only  a  subset  of 
the  direction  set  T>k,  which  may  contain  other  directions  as  well.  These  additional  directions 
could  be  chosen  heuristically,  according  to  some  knowledge  of  the  function  /  and  its 
scaling,  or  according  to  experience  on  previous  iterations.  They  could  also  be  chosen  as 
linear  combinations  of  the  core  set  of  directions  (the  ones  that  ensure  S  >  0). 

Note  that  Algorithm  9.2  does  not  require  us  to  choose  the  point  xk  +  /*/?*,  /?*  e  Vk, 
with  the  smallest  objective  value.  Indeed,  we  may  save  on  function  evaluations  by  not 
evaluating  /  at  all  points  in  the  frame,  but  rather  performing  the  evaluations  one  at  a  time 
and  accepting  the  first  candidate  point  that  satisfies  the  sufficient  decrease  condition. 


Figure  9.2  Generating  search  sets  in  R3:  coordinate  direction  set  (left)  and  simplex 
set  (right). 
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Another  important  detail  in  the  implementation  of  Algorithm  9.2  is  the  choice  of 
sufficient  decrease  function  p{t).  If  p(-)  is  chosen  to  be  identically  zero,  then  any  candidate 
point  that  produces  a  decrease  in  /  is  acceptable  as  a  new  iterate.  As  we  have  seen  in  Chapter  3, 
such  a  weak  condition  does  not  lead  to  strong  global  convergence  results  in  general.  A  more 
appropriate  choice  might  be  p(t)  —  Mt 3/2,  where  M  is  some  positive  constant. 

9.4  A  CONJUGATE-DIRECTION  METHOD 

We  have  seen  in  Chapter  5  that  the  minimizer  of  a  strictly  convex  quadratic  function 

f(x)  =  ^xT  Ax  —  bT  x  (9.25) 

can  be  located  by  performing  one-dimensional  minimizations  along  a  set  of  n  conjugate 
directions.  These  directions  were  defined  in  Chapter  5  as  a  linear  combination  of  gradients. 
In  this  section,  we  show  how  to  construct  conjugate  directions  using  only  function  values, 
and  we  therefore  devise  an  algorithm  for  minimizing  (9.25)  that  requires  only  function 
value  calculations.  Naturally,  we  also  consider  an  extension  of  this  approach  to  the  case  of  a 
nonlinear  objective  /. 

We  use  the  parallel  subspace  property,  which  we  describe  first  for  the  case  n  —  2. 
Consider  two  parallel  lines  l\{a)  —  X\  +  ap  and  li{a)  —  x-i  +  up,  where  X\,  X2,  and  p  are 
given  vectors  in  R2  and  a  is  the  scalar  parameter  that  defines  the  lines.  We  show  below  that 
if  x*  and  denote  the  minimizers  of  f(x)  along  1 1  and  I2,  respectively,  then  x*  —  x*  is 
conjugate  to  p.  Hence,  if  we  perform  a  one-dimensional  minimization  along  the  line  joining 
x\  and  x\,  we  will  reach  the  minimizer  of  /,  because  we  have  successively  minimized  along 
the  two  conjugate  directions  p  and  x\  —  x*.  This  process  is  illustrated  in  Figure  9.3. 

This  observation  suggests  the  following  algorithm  for  minimizing  a  two-dimensional 
quadratic  function  /.  We  choose  a  set  of  linearly  independent  directions,  say  the  coordinate 
directions  e\  and  ea-  From  any  initial  point  xq,  we  first  minimize  /  along  e2  to  obtain  the 
point  x ] .  We  then  perform  successive  minimizations  along  e\  and  e2 ,  starting  from  X] ,  to 
obtain  the  point  z.  It  follows  from  the  parallel  subspace  property  that  z  —  Xi  is  conjugate 
to  e2  because  Xi  and  z  are  minimizers  along  two  lines  parallel  to  e2  .  Thus,  if  we  perform  a 
one-dimensional  search  from  X\  along  the  direction  z  —  X\ ,  we  will  locate  the  minimizer  of /. 

We  now  state  the  parallel  subspace  minimization  property  in  its  most  general  form. 
Suppose  that  X\ ,  X2  are  two  distinct  points  in  R"  and  that  {pi ,  P2,  . . . ,  pi}  is  a  set  of  linearly 
independent  directions  in  R" .  Let  us  define  the  two  parallel  linear  varieties 

Si  —  |  X\  +  ^  Pi  |  a,-  e  R,  i 

S2—  |  X2  +  y,  a,-  pi  |  a,-  e  R,  i 

[  i=l 
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Figure  9.3 

Geometric  construction  of 
conjugate  directions.  (The 
minimizer  of  /  is  denoted  by 
x*.) 


If  we  denote  the  minimizers  of  /  on  S i  and  Sj  by  x*  and  x* ,  respectively,  then  xf  —  x*  is 
conjugate  to  p\ ,  p2,  . . . ,  pi.  It  is  easy  to  verify  this  claim.  By  the  minimization  property,  we 
have  that 


3 f(x*  +0CiPi ) 


doe : 


=  V/(x*)rp/  =  0, 


i  =  1,2, 


a  :=0 


and  similarly  for  x2.  Therefore  we  have  from  (9.25)  that 


0  =  (Vf(x*)  -  Vf(x*))TPi 

—  (Ax*  —  b  —  Ax\  +  b)T  pi 

—  (x*  —  x^)7  Api,  i  —  1,  2 


(9.26) 


We  now  consider  the  case  n  =  3  and  show  how  the  parallel  subspace  property 
can  be  used  to  generate  a  set  of  three  conjugate  directions.  We  choose  a  set  of  linearly 
independent  directions,  say  e\ ,  e 2,  e 3.  From  any  starting  point  xo  we  first  minimize  /  along 
the  last  direction  e$  to  obtain  a  point  X\ .  We  then  perform  three  successive  one-dimensional 
minimizations,  starting  from  X\,  along  the  directions  e\,  ei,  e$  and  denote  the  resulting 
point  by  z.  Next,  we  minimize  /  along  the  direction  p\  —  z  —  X\  to  obtain  x2.  As  noted 
earlier,  p\  —  z  —  X\  is  conjugate  to  e 3.  We  note  also  that  x2  is  the  minimizer  of  /  on  the  set 
Si  =  [y  +  aie3  +  a2y>i  |  «i  e  R,  a2  e  R},  where  y  is  the  intermediate  point  obtained  after 
minimizing  along  e\  and  e2. 

A  new  iteration  now  commences.  We  discard  e\  and  define  the  new  set  of  search 
directions  as  e2,  e^,  pi .  We  perform  one-dimensional  minimizations  along  e2,  e3,  pu  starting 
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from  X2,  to  obtain  the  point  z.  Note  that  z  can  be  viewed  as  the  minimizer  of  /  on  the  set 
S2  —  {>’  +  a ie3  +a2p\\ai  e  R,  012  e  R},  for  some  intermediate  point  y.  Therefore,  by 
applying  the  parallel  subspace  minimization  property  to  the  sets  Si  and  S2  just  defined,  we 
have  that  p2  —  Z  —  x2  is  conjugate  to  both  e3  and  p\.  We  then  minimize  /  along  p2  to 
obtain  a  point  v3,  which  is  the  minimizer  of  /.  This  procedure  thus  generates  the  conjugate 
directions  e2,  p  1,  p2. 

We  can  now  state  the  general  algorithm,  which  consists  of  an  inner  and  an  outer 
iteration.  In  the  inner  iteration,  n  one-dimensional  minimizations  are  performed  along  a  set 
of  linearly  independent  directions.  Upon  completion  of  the  inner  iteration,  a  new  conjugate 
direction  is  generated,  which  replaces  one  of  the  previously  stored  search  directions. 

Algorithm  9.3  (DFO  Method  of  Conjusate  Directions). 

Choose  an  initial  point  x0  and  set  p,  =  <?,  ,  for  i  =  1,2 
Compute  X\  as  the  minimizer  of  /  along  the  line  Xq  +  ap„; 

Set  k  <-  1. 

repeat  until  a  convergence  test  is  satisfied 

Set  zi  <r-  xk\ 

for  j  —  1,2, ...  ,n 

Calculate  aj  so  that  f(zj  +  oijPj)  is  minimized; 

Set  zj+i  Zj+ctjPj-, 

end  (for) 

Set  pj  a-  pj+1  for  j  —  1,2, - n  -  1  and  pn  -e-  z„+i  -  Zil 

Calculate  a„  so  that  f(zn+ 1  +  ctnpn)  is  minimized; 

Set  Xk+l  *  Zn+ 1  Pn , 

Set  k  k  +  1; 

end  (repeat) 

The  line  searches  can  be  performed  by  quadratic  interpolation  using  three  function 
values  along  each  search  direction.  Since  the  restriction  of  (9.25)  to  a  line  is  a  (strictly 
convex)  quadratic,  the  interpolating  quadratic  matches  it  exactly,  and  the  one- dimensional 
minimizer  can  easily  be  computed.  Note  that  at  the  end  of  (the  outer)  iteration  k,  the 
directions  pn-k,  pn-k+\ ,  ■  ■  • ,  p„  are  conjugate  by  the  property  mentioned  above.  Thus  the 
algorithm  terminates  at  the  minimizer  of  (9.25)  after  n  —  1  iterations,  provided  none  of  the 
conjugate  directions  is  zero.  Unfortunately,  this  possibility  cannot  be  ruled  out,  and  some 
safeguards  described  below  must  be  incorporated  to  improve  robustness.  In  the  (usual) 
case  that  Algorithm  9.3  terminates  after  n  —  1  iterations,  it  will  perform  0(n2)  function 
evaluations. 

Algorithm  9.3  can  be  extended  to  minimize  nonquadratic  objective  functions.  The 
only  change  is  in  the  line  search,  which  must  be  performed  approximately,  using  interpola¬ 
tion.  Because  of  the  possible  nonconvexity,  this  one-dimensional  search  must  be  done  with 
care;  see  Brent  [39]  for  a  treatment  of  this  subject.  Numerical  experience  indicates  that  this 
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extension  of  Algorithm  9.3  performs  adequately  for  small-dimensional  problems  but  that 
sometimes  the  directions  { p, }  tend  to  become  linearly  dependent.  Several  modifications  of 
the  algorithm  have  been  proposed  to  guard  against  this  possibility.  One  such  modification 
measures  the  degree  to  which  the  directions  {/t,  }  are  conjugate.  To  do  so,  we  define  the 
scaled  directions 


Pi  =  ■  Pi  ,  i  =  l,2,  (9.27) 

One  can  show  [239]  that  the  quantity 

\det(pu  p2, pn)\  (9.28) 

is  maximized  if  and  only  if  the  vectors  /?,  are  conj  ugate  with  respect  to  A .  This  result  suggests 
that  we  should  not  replace  one  of  the  existing  search  directions  in  the  set  {p\,  P2, . . . ,  p,,} 
by  the  most  recently  generated  conjugate  direction  if  this  action  causes  the  quantity  (9.28) 
to  decrease. 

Procedure  9.4  implements  this  strategy  for  the  case  of  the  quadratic  objective  function 
(9.25).  Some  algebraic  manipulations  (which  we  do  not  present  here)  show  that  we  can 
compute  the  scaled  directions  /),-  without  using  the  Hessian  A  because  the  terms  pf  Apt 
are  available  from  the  line  search  along  p,.  Further,  only  comparisons  using  computed 
function  values  are  needed  to  ensure  that  (9.28)  does  not  increase.  The  following  pro¬ 
cedure  is  invoked  immediately  after  the  execution  of  the  inner  iteration  (or  for-loop)  of 
Algorithm  9.3. 

Procedure  9.4  (Updating  of  the  Set  of  Directions). 

Find  the  integer  m  e  {1,  2,  . . . ,  n}  such  that  i jfm  —  f{xm-\)  —  f(xm) 
is  maximized; 

Let  /i  =  /(zi),  f2  =  f(zn+ i),  and  /3  =  f{2zn+i  -  Zi); 
if  h  >  f\  or  (/i  -  2/2  +  /3)(/i  -  fi-  1 K)2  >  -  h)2 

Keep  the  set  p\,  p2, . . . ,  p„  unchanged  and  set  Xk+ 1  <—  zn+\ ; 

else 

Set  p  <—  zn+i  ~  Zi  and  calculate  a  so  that  f(zn+ 1  +  dtp)  is  minimized; 

Set  Xk+i  <-  zn+ 1  +  ap; 

Remove  pm  from  the  set  of  directions  and  add  /)  to  this  set; 

end  (if) 


This  procedure  can  be  applied  to  general  objective  functions  by  implementing  inexact 
one-dimensional  line  searches.  The  resulting  conjugate-gradient  method  has  been  found  to 
be  useful  for  solving  small  dimensional  problems. 
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9.5  N ELDER-MEAD  METHOD 


The  Nelder-Mead  simplex-reflection  method  has  been  a  popular  DFO  method  since  its 
introduction  in  1965  [223] .  It  takes  its  name  from  the  fact  that  at  any  stage  of  the  algorithm, 
we  keep  track  of  n  +  1  points  of  interest  in  R",  whose  convex  hull  forms  a  simplex.  (The 
method  has  nothing  to  do  with  the  simplex  method  for  linear  programming  discussed 
in  Chapter  13.)  Given  a  simplex  S  with  vertices  [zi,  z2,  ■  ■  ■  ,  Zn+ 1}>  we  can  define  an  as¬ 
sociated  matrix  V (S)  by  taking  the  n  edges  along  V  from  one  of  its  vertices  (zi,  say),  as 
follows: 


V(S)  =  [z2  -  Zl,  Z3-Z1,...,  Zn+ 1  -  Zl]  ■ 

The  simplex  is  said  to  be  nondegenerate  or  nonsingular  if  V  is  a  nonsingular  matrix. 
(For  example,  a  simplex  in  R3  is  nondegenerate  if  its  four  vertices  are  not 
coplanar.) 

In  a  single  iteration  of  the  Nelder-Mead  algorithm,  we  seek  to  remove  the  vertex  with 
the  worst  function  value  and  replace  it  with  another  point  with  a  better  value.  The  new 
point  is  obtained  by  reflecting,  expanding,  or  contracting  the  simplex  along  the  line  joining 
the  worst  vertex  with  the  centroid  of  the  remaining  vertices.  If  we  cannot  find  a  better  point 
in  this  manner,  we  retain  only  the  vertex  with  the  best  function  value,  and  we  shrink  the 
simplex  by  moving  all  other  vertices  toward  this  value. 

We  specify  a  single  step  of  the  algorithm  after  some  defining  some  notation.  The 
n  +  1  vertices  of  the  current  simplex  are  denoted  by  {x\,X2,  . . . ,  xn+ 1 } ,  where  we  choose  the 
ordering  so  that 


fix  1)  <  f(x2)  <  <  f(xn+ 1). 


The  centroid  of  the  best  n  points  is  denoted  by 

n 

x  =  x‘  ■ 

i=  1 

Points  along  the  line  joining  x  and  the  “worst”  vertex  x„+1  are  denoted  by 

x(t)  —  x  +  t(xn+ 1  —  x). 

Procedure  9.5  (One  Step  of  Nelder-Mead  Simplex). 

Compute  the  reflection  point  x(—  1)  and  evaluate  /_  1  =  f(x{—  1)); 
if  fix  1)  <  /-i  <  f[x„) 

(*  reflected  point  is  neither  best  nor  worst  in  the  new  simplex  *) 
replace  xn+\  by  Jc(—  1)  and  go  to  next  iteration; 
else  if  /_ !  <  fix  1) 
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(*  reflected  point  is  better  than  the  current  best;  try  to 
go  farther  along  this  direction  *) 

Compute  the  expansion  point  x(— 2)  and  evaluate  /_ 2  =  f(x(—2)); 

if  f-2  <  /- 1 

replace  x„+1  by  tc_2  and  go  to  next  iteration; 

else 

replace  xn+i  by  X-\  and  go  to  next  iteration; 
else  if  /_i  >  f{x n) 

(*  reflected  point  is  still  worse  than  x„ ;  contract  *) 

if  fixn)  <  f- 1  <  f(x„+ 1) 

(*  try  to  perform  “outside”  contraction  *) 
evaluate  /-1/2  =  x(— 1/2); 
if  f-1/2  <  f- 1 

replace  x„+i  by  x_i/2  and  go  to  next  iteration; 

else 

(*  try  to  perform  “inside”  contraction  *) 
evaluate  /i/2  =  i(l/2); 
if  /1/2  <  f,+ 1 

replace  x„+i  by  xi/2  and  go  to  next  iteration; 

(*  neither  outside  nor  inside  contraction  was  acceptable; 

shrink  the  simplex  toward  X\  *) 
replace  x,  a-  (l/2)(xi  +  x()  for  i  —  2,  3,  . . . ,  n  +  1; 


Procedure  9.5  is  illustrated  on  a  three-dimensional  example  in  Figure  9.4.  The  worst 
current  vertex  is  X3,  and  the  possible  replacement  points  are.r(— 1),  x(—2),  x(—  | ),  x(f ).  If 
none  of  the  replacement  points  proves  to  be  satisfactory,  the  simplex  is  shrunk  to  the  smaller 
triangle  indicated  by  the  dotted  line,  which  retains  the  best  vertex  x\.  The  scalars  t  used 
in  defining  the  candidate  points  x(t)  have  been  assigned  the  specific  (and  standard)  values 
—  1,  —2,  — f,  and  j  in  our  description  above.  Different  choices  are  also  possible,  subject  to 
certain  restrictions. 

Practical  performance  of  the  Nelder-Mead  algorithm  is  often  reasonable,  though 
stagnation  has  been  observed  to  occur  at  nonoptimal  points.  Restarting  can  be  used  when 
stagnation  is  detected;  see  Kelley  [  178] .  Note  that  unless  the  final  shrinkage  step  is  performed, 
the  average  function  value 


1 

n  +  1 


tt+1 

£/(*,) 


1=1 


(9.29) 


will  decrease  at  each  step.  When  /  is  convex,  even  the  shrinkage  step  is  guaranteed  not  to 
increase  the  average  function  value. 
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Figure  9.4  One  step  of  the  Nelder-Mead  simplex  method  in  R3,  showing  current 
simplex  (solid  triangle  with  vertices  X\,  x-2,  x 3),  reflection  point  x{—  1),  expansion 
point  x(— 2),  inside  contraction  point  x(i),  outside  contraction  point  jc(— |),  and 
shrunken  simplex  (dotted  triangle). 


A  limited  amount  of  convergence  theory  has  been  developed  for  the  Nelder-Mead 
method  in  recent  years;  see,  for  example,  Kelley  [179]  and  Lagarias  et  al.  [186]. 


9.6  IMPLICIT  FILTERING 


We  now  describe  an  algorithm  designed  for  functions  whose  evaluations  are  modeled  by 
(9.2),  where  h  is  smooth.  This  implicit  filtering  approach  is,  in  its  simplest  form,  a  variant  of 
the  steepest  descent  algorithm  with  line  search  discussed  in  Chapter  3,  in  which  the  gradient 
V/i-  is  replaced  by  a  finite  difference  estimate  such  as  (9.3),  with  a  difference  parameter  e 
that  may  not  be  particularly  small. 

Implicit  filtering  works  best  on  functions  for  which  the  noise  level  decreases  as  the 
iterates  approach  a  solution.  This  situation  may  occur  when  we  have  control  over  the  noise 
level,  as  is  the  case  when  f  is  obtained  by  solving  a  differential  equation  to  a  user-specified 
tolerance,  or  by  running  a  stochastic  simulation  for  a  user- specified  number  of  trials  (where 
an  increase  in  the  number  of  trials  usually  produces  a  decrease  in  the  noise).  The  implicit 
filtering  algorithm  decreases  e  systematically  (but,  one  hopes,  not  as  rapidly  as  the  decay  in 
error)  so  as  to  maintain  reasonable  accuracy  in  Ve/(x),  given  the  noise  level  at  the  current 
value  of  A'.  For  each  value  of  e,  it  performs  an  inner  loop  that  is  simply  an  Armijo  line  search 
using  the  search  direction  —  Ve/(x).  If  the  inner  loop  is  unable  to  find  a  satisfactory  step 
length  after  backtracking  at  least  flmax  times,  we  return  to  the  outer  loop,  choose  a  smaller 
value  of  e,  and  repeat.  A  formal  specification  follows. 
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Algorithm  9.6  (Implicit  Filtering). 

Choose  a  sequence  {^a }  4  0,  Armijo  parameters  c  and  p  in  (0,  1), 

maximum  backtracking  parameter  amax; 

Set  k  <-  1,  Choose  initial  point  x  —  xq; 

repeat 

increment_k  <—  false; 

repeat 

Compute  f(x )  and  Vet/(x); 

if  ||Ve*/(*)||  <  et 

increment_k  •<—  true; 

else 

Find  the  smallest  integer  m  between  0  and  umax  such  that 
f  (x  -  p"'V<J{x))  <  f(x)  -  cpm  ||Vft/U)||J; 

if  no  such  m  exists 

increment_k  •<—  true; 

else 

x  x  —  p"'Ve/(x); 

until  increment _k; 

Xk  <-  x;k  <-  k+  l; 

until  a  termination  test  is  satisfied. 

Note  that  the  inner  loop  in  Algorithm  9.6  is  essentially  the  backtracking  line  search 
algorithm — Algorithm  3.1  of  Chapter  3 — with  a  convergence  criterion  added  to  detect 
whether  the  minimum  appears  to  have  been  found  to  within  the  accuracy  implied  by  the  dif¬ 
ference  parameter  €k  ■  If  the  gradient  estimate  VfJt  /  is  small,  or  if  the  line  search  fails  to  find  a 
satisfactory  new  iterate  (indicating  that  the  gradient  approximation  Vet  fix)  is  insufficiently 
accurate  to  produce  descent  in  /),  we  decrease  the  difference  parameter  to  e*+i  and  proceed. 

A  basic  convergence  result  for  Algorithm  9.6  is  the  following. 

Theorem  9.2. 

Suppose  that  V2ft  is  Lipschitz  continuous,  that  Algorithm  9.6  generates  an  infinite 
sequence  of  iterates  {x*},  and  that 


lim  el  + 

k—>oo 


n(xk\  €k) 


Suppose,  too,  that  all  but  a  finite  number  of  inner  loops  in  Algorithm  9.6  terminate  with 
|  Vet/(xi)  |  <  €k-  Then  all  limit  points  of  the  sequence  {jca}  are  stationary. 


Proof.  Using  {6^-}  4-  0,  we  have  under  our  assumptions  on  inner  loop  termination  that 
VCt  f(Xk)  — »•  0.  By  invoking  the  error  bound  (9.5)  and  noting  that  the  right-hand  side  of 
this  expression  is  approaching  zero,  we  conclude  that  Vhixk)  ->  0.  Hence  all  limit  points 
satisfy  Vh(x)  =  0,  as  claimed.  □ 
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More  sophisticated  versions  of  implicit  filtering  methods  can  be  derived  by  using 
the  gradient  estimate  VfJt/  to  construct  quasi-Newton  approximate  Hessians,  and  thus 
generating  quasi-Newton  search  directions  instead  of  the  negative-approximate-gradient 
search  direction  used  in  Algorithm  9.6. 

NOTES  AND  REFERENCES 

A  classical  reference  on  derivative- free  methods  is  Brent  [39],  which  focuses  primarily 
on  one-dimensional  problems  and  includes  discussion  of  roundoff  errors  and  global  min¬ 
imization.  Recent  surveys  on  derivative- free  methods  include  Wright  [314],  Powell  [256], 
Conn,  Scheinberg,  and  Toint  [76],  and  Kolda,  Lewis,  and  Torczon  [183]. 

The  first  model-based  method  for  derivative-free  optimization  was  proposed  by  Win¬ 
field  [307],  It  uses  quadratic  models,  which  are  determined  by  the  interpolation  conditions 
(9.7),  and  computes  steps  by  solving  a  subproblem  of  the  form  (9.9).  Practical  procedures 
for  improving  the  geometry  of  the  interpolation  points  were  first  developed  by  Powell  in 
the  context  of  model-based  methods  using  linear  and  quadratic  polynomials;  see  [256]  for 
a  review  of  this  work. 

Conn,  Scheinberg,  and  Toint  [75]  propose  and  analyze  model-based  methods  and 
study  the  use  of  Newton  fundamental  polynomials.  Methods  that  combine  minimum  change 
updating  and  interpolation  are  discussed  by  Powell  [258,  260].  Our  presentation  of  model- 
based  methods  in  Section  9.2  is  based  on  [76,  259,  258]. 

For  a  comprehensive  discussion  of  pattern-search  methods  of  the  type  discussed  here, 
we  refer  the  reader  to  the  review  paper  of  Kolda,  Lewis,  and  Torczon  [  183] ,  and  the  references 
therein. 

The  method  of  conjugate  directions  given  in  Algorithm  9.3  was  proposed  by  Pow¬ 
ell  [239],  For  a  discussion  on  the  rate  of  convergence  of  the  coordinate  descent  method  and 
for  more  references  about  this  method,  see  Luenberger  [195].  For  further  information  on 
implicit  filtering,  see  Kelley  [179]  and  Choi  and  Kelley  [60]  and  the  references  therein. 

Software  packages  that  implement  model-based  methods  include  cobyla  [258],  dfo 
[75],  uobyqa  [257],  wedge  [200],  and  newuoa  [260].  The  earliest  code  is  cobyla,  which 
employs  linear  models,  dfo,  uobyqa,  and  wedge  use  quadratic  models,  whereas  the  method 
based  on  minimum  change  updating  (9.19)  is  implemented  in  newuoa.  A  pattern-search 
method  is  implemented  in  apps  [171],  while  direct  [  1 73 ]  is  designed  to  find  a  global  solution. 


&  Exercises 

&  9.1  Prove  Lemma  9.1. 

&  9.2 

(a)  Verify  that  the  number  of  interpolation  conditions  to  uniquely  determine  the 
coefficients  in  (9.6)  are  q  —  \ (n  +  1  )(n  +  2). 
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(b)  Verify  that  the  number  of  vertices  and  midpoints  of  the  edges  of  a  nondegenerate 
simplex  in  Rn  add  up  to  q  —  \{n  +  1)(«  +  2)  and  can  therefore  be  used  as  the  initial 
interpolation  set  in  a  DFO  algorithm. 

(c)  How  many  interpolation  conditions  would  be  required  to  determine  the  coefficients 
in  (9.6)  if  the  matrix  G  were  identically  0?  How  many  if  G  were  diagonal?  How  many 
if  G  were  tridiagonal? 

i#7  9.3  Describe  conditions  on  the  vectors  s'  that  guarantee  that  the  model  (9.14)  is 

uniquely  determined. 

9.4  Consider  the  determination  of  a  quadratic  function  in  two  variables. 

(a)  Show  that  six  points  on  a  line  do  not  determine  the  quadratic. 

(b)  Show  that  six  points  in  a  circle  in  the  plane  do  not  uniquely  determine  the  quadratic. 

9.5  Use  induction  to  show  that  at  the  end  of  the  outer  iteration  k  of  Algorithm  9.3,  the 
directions  pn_k ,  pn-k+\ ,  ■  ■  • ,  pn  are  conjugate.  Use  this  fact  to  show  that  if  the  step  lengths 
a,-  in  Algorithm  9.3  are  never  zero,  the  iteration  terminates  at  the  minimizer  of  (9.25)  after 
at  most  n  outer  iterations. 

i#7  9.6  Write  a  program  that  computes  the  one-dimensional  minimizer  of  a  strictly 

convex  quadratic  function  /  along  a  direction  p  using  quadratic  interpolation.  Describe 
the  formulas  used  in  your  program. 

i#7  9.7  Find  the  quadratic  function 


m(x  i,xa)  —  /  +  giXi  +  g2*2  +  -GjjXi  +  G12X1X2  +  -G\2x% 

that  interpolates  the  following  data:  x0  =  y1  —  (0,  0)r,  y2  —  (l,0)r,  y3  =  (2,  0)r, 
y4  =  (1,  l)r,  y5  =  (0,2 )r,y6  =  (0,  l)r,and/(y1)  =  1  ,/(y2)  =  2.0084, /(y3)  =  7.0091, 
/(y4)  =  1.0168,  /(y5)  =  -0.9909,  and  /(y6)  =  -0.9916. 

i#7  9.8  Find  the  value  of  S  for  which  the  coordinate  generating  set  (9.23)  satisfies  the 

property  (9.21). 

i#7  9.9  Show  that  k (XV )  =  0,  where  k(-)  is  defined  by  (9.21)  andD^-  =  { e —  e,-}  for  any 

i  =  1,2 

i#7  9.10  (Hard)  Prove  that  the  generating  set  (9.24)  satisfies  the  property  (9.21)  for  a 

certain  value  8  >  0,  and  find  this  value  of  8. 

&  9.11  Justify  the  statement  that  the  average  function  value  at  the  Nelder-Mead  simplex 

points  will  decrease  over  one  step  if  any  of  the  points  x(—  l),x(— 2),x(—  |),x(|)are  adopted 
as  a  replacement  for  xn+1. 
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&  9.12  Show  that  if  /  is  a  convex  function,  the  shrinkage  step  in  the  Nelder-Mead 

simplex  method  will  not  increase  the  average  value  of  the  function  over  the  simplex  vertices 
defined  by  (9.29).  Show  that  unless  f(xi)  —  /(xz)  —  ■  ■  ■  —  f(xn+ 1 ),  the  average  value  will 
in  fact  decrease. 

&  9.13  Suppose  for  the  /  defined  in  (9.2),  we  define  the  approximate  gradient  V6/(x) 

by  the  forward-difference  formula 


1=1,2, ...,n 

rather  than  the  central-difference  formula  (9.3).  (This  formula  requires  only  half  as  many 
function  evaluations  but  is  less  accurate.)  For  this  definition,  prove  the  following  variant 
of  Lemma  9.1:  Suppose  that  V/z(x)  is  Lipschitz  continuous  in  a  neighborhood  of  the  box 
[z\z>x,  ||  z  -*||oo  <  e }  with  Lipschitz  constant  L  /, .  Then  we  have 

II  V6/(x)  -  V/l(x) lloo  <  Lh€  +  n{x^l, 

6 

where  ij(x;  e)  is  redefined  as  follows: 

V(x-,€)=  sup  |0(z)|. 

z>x,  ||z — jr lloo<e 


V6/(jc)  = 


f(x  +  €et)  -  f(x) 


Chapter 


Least-Squares 

Problems 


In  least-squares  problems,  the  objective  function  /  has  the  following  special  form: 

m 

/(*)=  lEb2(*)’  (1°.D 

j= 1 

where  each  rj  is  a  smooth  function  from  R"  to  R.  We  refer  to  each  rj  as  a  residual,  and  we 
assume  throughout  this  chapter  that  m  >  n. 

Least-squares  problems  arise  in  many  areas  of  applications,  and  may  in  fact  be  the 
largest  source  of  unconstrained  optimization  problems.  Many  who  formulate  a  parametrized 
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model  for  a  chemical,  physical,  financial,  or  economic  application  use  a  function  of  the 
form  (10.1)  to  measure  the  discrepancy  between  the  model  and  the  observed  behavior  of 
the  system  (see  Example  2.1,  for  instance).  By  minimizing  this  function,  they  select  values 
for  the  parameters  that  best  match  the  model  to  the  data.  In  this  chapter  we  show  how  to 
devise  efficient,  robust  minimization  algorithms  by  exploiting  the  special  structure  of  the 
function  /  and  its  derivatives. 

To  see  why  the  special  form  of  f  often  makes  least-squares  problems  easier  to  solve  than 
general  unconstrained  minimization  problems,  we  first  assemble  the  individual  components 
rj  from  (10.1)  into  a  residual  vector  r  :  R"  — >  Rm,  as  follows 


r(x)  —  (n(x),  r2(x),  rm(x))T.  (10.2) 

Using  this  notation,  we  can  rewrite  /  as  f(x)  —  |  ||r(x)|||.  The  derivatives  of  f(x)  can  be 
expressed  in  terms  of  the  Jacobian  J(x),  which  is  the  m  x  n  matrix  of  first  partial  derivatives 
of  the  residuals,  defined  by 


J{x)  = 


/=1,2,  ...,m 

7  =  1,2, 


v/-i(*)r 

Vr2(x)r 


V  r m  (x ) T 


(10.3) 


where  each  Vr,(x),  j  —  1,2,...,  m  is  the  gradient  of  r;.  The  gradient  and  Hessian  of  / 
can  then  be  expressed  as  follows: 


V/(x)  =  Tj(x)Vrj{x)  —  J(x)Tr(x),  (10.4) 

j= i 

m  m 

V2/(x)  =  ^  Vrj(x)Vrj(x)T  +  ^  rj(x)V2ry  (x) 

7=4  i=  i 

m 

—  J(x)T  J(x)  +  fj(x)V2rj(x).  (10.5) 

i=i 

In  many  applications,  the  first  partial  derivatives  of  the  residuals  and  hence  the 
Jacobian  matrix  J(x)  are  relatively  easy  or  inexpensive  to  calculate.  We  can  thus  obtain 
the  gradient  V/(x)  as  written  in  formula  (10.4).  Using  J{x),  we  also  can  calculate  the 
first  term  J (x)T  J {x)  in  the  Hessian  V2/(x)  without  evaluating  any  second  derivatives  of 
the  functions  r j.  This  availability  of  part  of  V2/(x)  “for  free”  is  the  distinctive  feature  of 
least-squares  problems.  Moreover,  this  term  J(x)T  J(x)  is  often  more  important  than  the 
second  summation  term  in  (10.5),  either  because  the  residuals  r;-  are  close  to  affine  near 
the  solution  (that  is,  the  V2rj(x)  are  relatively  small)  or  because  of  small  residuals  (that 
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is,  the  i~j (x )  are  relatively  small).  Most  algorithms  for  nonlinear  least-squares  exploit  these 
structural  properties  of  the  Hessian. 

The  most  popular  algorithms  for  minimizing  (10.1)  fit  into  the  line  search  and 
trust-region  frameworks  described  in  earlier  chapters.  They  are  based  on  the  Newton  and 
quasi-Newton  approaches  described  earlier,  with  modifications  that  exploit  the  particular 
structure  of  /. 

Section  10.1  contains  some  background  on  applications.  Section  10.2  discusses  lin¬ 
ear  least-squares  problems,  which  provide  important  motivation  for  algorithms  for  the 
nonlinear  problem.  Section  10.3  describes  the  major  algorithms,  while  Section  10.4  briefly 
describes  a  variant  of  least  squares  known  as  orthogonal  distance  regression. 

Throughout  this  chapter,  we  use  the  notation  ||  •  ||  to  denote  the  Euclidean  norm  ||  •  ||2, 
unless  a  subscript  indicates  that  some  other  norm  is  intended. 


10.1  BACKGROUND 


We  discuss  a  simple  parametrized  model  and  show  how  least- squares  techniques  can  be 
used  to  choose  the  parameters  that  best  fit  the  model  to  the  observed  data. 


□  Example  1 0.1 

We  would  like  to  study  the  effect  of  a  certain  medication  on  a  patient.  We  draw  blood 
samples  at  certain  times  after  the  patient  takes  a  dose,  and  measure  the  concentration  of  the 
medication  in  each  sample,  tabulating  the  time  tj  and  concentration  y;  for  each  sample. 

Based  on  our  previous  experience  in  such  experiments,  we  find  that  the  following 
function  <f>(  x;  t)  provides  a  good  prediction  of  the  concentration  at  time  t,  for  appropriate 
values  of  the  five-dimensional  parameter  vector  x  —  (xi,  Xi,  X3,  X4,  X5): 

0(x;  t)  =  Xi  +  tx 2  +  t2x 3  +  X4e~xst .  (10.6) 

We  choose  the  parameter  vector  x  so  that  this  model  best  agrees  with  our  observation,  in 
some  sense.  A  good  way  to  measure  the  difference  between  the  predicted  model  values  and 
the  observations  is  the  following  least-squares  function: 

m 

-  yj]2’  (i0-7) 

7=1 

which  sums  the  squares  of  the  discrepancies  between  predictions  and  observations  at  each 
tj.  This  function  has  precisely  the  form  (10.1)  if  we  define 


rj(x)  =  0(x;  tj)  -  yj. 


(10.8) 
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Figure  10.1  Model  (10.7)  (smooth  curve)  and  the  observed  measurements,  with 
deviations  indicated  by  vertical  dotted  lines. 


Graphically,  each  term  in  (10.7)  represents  the  square  of  the  vertical  distance  between 
the  curve  <p{x\  t )  (plotted  as  a  function  of  t)  and  the  point  ( tj ,  yj),  for  a  fixed  choice 
of  parameter  vector  x;  see  Figure  10.1.  The  minimizer  x*  of  the  least-squares  problem  is 
the  parameter  vector  for  which  the  sum  of  squares  of  the  lengths  of  the  dotted  lines  in 
Figure  10.1  is  minimized.  Having  obtained  v*,  we  use  <p{x*\  t)  to  estimate  the  concentration 
of  medication  remaining  in  the  patient’s  bloodstream  at  any  time  t. 


This  model  is  an  example  of  what  statisticians  call  a  fixed-regressor  model.  It  assumes 
that  the  times  tj  at  which  the  blood  samples  are  drawn  are  known  to  high  accuracy,  while 
the  observations  yj  may  contain  more  or  less  random  errors  due  to  the  limitations  of  the 
equipment  (or  the  lab  technician!) 

In  general  data-fitting  problems  of  the  type  just  described,  the  ordinate  t  in  the  model 
fi(x\  t )  could  be  a  vector  instead  of  a  scalar.  (In  the  example  above,  for  instance,  t  could 
have  two  dimensions,  with  the  first  dimension  representing  the  time  since  the  drug  was 
admistered  and  the  second  dimension  representing  the  weight  of  the  patient.  We  could  then 
use  observations  for  an  entire  population  of  patients,  not  just  a  single  patient,  to  obtain  the 
“best”  parameters  for  this  model.) 

The  sum-of-squares  function  ( 10.7)  is  not  the  only  way  of  measuring  the  discrepancy 
between  the  model  and  the  observations.  Other  common  measures  include  the  maximum 
absolute  value 


max  \<p{x\  tj)  -  yj\ 


(10.9) 
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and  the  sum  of  absolute  values 


\<Kx-,  tj)  -yj \.  (10.10) 

l=i 

By  using  the  definitions  of  the  loo  and  l  \  norms,  we  can  rewrite  these  two  measures  as 

fix)  —  ||r(x)||oo,  /(*)  =  IkMIli.  (10.11) 

respectively.  As  we  discuss  in  Chapter  17,  the  problem  of  minimizing  the  functions  (10.1 1) 
can  be  reformulated  a  smooth  constrained  optimization  problem. 

In  this  chapter  we  focus  only  on  the  i^-norm  formulation  ( 10.1).  In  some  situations, 
there  are  statistical  motivations  for  choosing  the  least-squares  criterion.  Changing  the  no¬ 
tation  slightly,  let  the  discrepancies  between  model  and  observation  be  denoted  by  ej,  that 
is, 


ej  =  <Kx\  tj)  -  yj. 

It  often  is  reasonable  to  assume  that  the  e/s  are  independent  and  identically  distributed 
with  a  certain  variance  a 1  and  probability  density  function  ga{-).  (This  assumption  will 
often  be  true,  for  instance,  when  the  model  accurately  reflects  the  actual  process,  and  when 
the  errors  made  in  obtaining  the  measurements  y;  do  not  contain  a  systematic  bias.)  Under 
this  assumption,  the  likelihood  of  a  particular  set  of  observations  y; ,  j  —  1,2 , ,m,  given 
that  the  actual  parameter  vector  is  x,  is  given  by  the  function 


p{y\  x ,  a)  -  Jugate/)  =  tj)  -  yj).  (10.12) 

7=1  7=1 

Given  the  observations  yi ,  yi, ... ,  ym,  the  “most  likely”  value  of  x  is  obtained  by  maximizing 
p{y\  x,  cr)  with  respect  to  x.  The  resulting  value  of  x  is  called  the  maximum  likelihood 
estimate. 

When  we  assume  that  the  discrepancies  follow  a  normal  distribution,  we  have 

*"<f)=7ibexp(-^)' 

Substitution  in  (10.12)  yields 


m  \ 

Y^4>(x;  tj)  -  yj]2  J  . 


p(y;  x,  a)  =  (2na2)  m/1  exp 
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For  any  fixed  value  of  the  variance  a1,  it  is  obvious  that  p  is  maximized  when  the  sum 
of  squares  (10.7)  is  minimized.  To  summarize:  When  the  discrepancies  are  assumed  to  be 
independent  and  identically  distributed  with  a  normal  distribution  function,  the  maximum 
likelihood  estimate  is  obtained  by  minimizing  the  sum  of  squares. 

The  assumptions  on  €j  in  the  previous  paragraph  are  common,  but  they  do  not 
describe  the  only  situation  for  which  the  minimizer  of  the  sum  of  squares  makes  good 
statistical  sense.  Seber  and  Wild  [280]  describe  many  instances  in  which  minimization  of 
functions  like  ( 10.7),  or  generalizations  of  this  function  such  as 

r{x)T  Wr(x),  where  W  e  R"IX"'  is  symmetric, 
is  the  crucial  step  in  obtaining  estimates  of  the  parameters  x  from  observed  data. 


10.2  LINEAR  LEAST-SQUARES  PROBLEMS 


Many  models  </>(x;  t)  in  data-fitting  problems  are  linear  functions  of  x.  In  these  cases,  the 
residuals  r;(x)  defined  by  (10.8)  also  are  linear,  and  the  problem  of  minimizing  (10.7)  is 
called  a  linear  least-squares  problem.  We  can  write  the  residual  vector  as  r(x)  —  Jx  —  y  for 
some  matrix  J  and  vector  y,  both  independent  of  x,  so  that  the  objective  is 

/(*)  =  \WJx-yt,  (10.13) 


where  y  —  r(0).  We  also  have 

V/(x)  =  JT(Jx  -  y),  V2/(x)  =  JTJ. 

(Note  that  the  second  term  in  V2/(x)  (see  (10.5))  disappears,  because  V2r;-  =  0  for  all 
j  —  1,2, ,  m.)  It  is  easy  to  see  that  the  f(x)  in  (10.13)  is  convex — a  property  that  does 
not  necessarily  hold  for  the  nonlinear  problem  (10.1).  Theorem  2.5  tells  us  that  any  point  x* 
for  which  V  /(x*)  =  0  is  the  global  minimizer  of  /.  Therefore,  x*  must  satisfy  the  following 
linear  system  of  equations: 


JT  Jx*  =  JTy.  (10.14) 

These  are  known  as  the  normal  equations  for  (10.13). 

We  outline  briefly  three  major  algorithms  for  the  unconstrained  linear  least-squares 
problem.  We  assume  in  most  of  our  discussion  that  m  >  n  and  that  J  has  full  column 
rank. 
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The  first  and  most  obvious  algorithm  is  simply  to  form  and  solve  the  system  (10.14) 
by  the  following  three-step  procedure: 

•  compute  the  coefficient  matrix  JT  J  and  the  right-hand-side  JTy; 

•  compute  the  Cholesky  factorization  of  the  symmetric  matrix  JT  J; 

•  perform  two  triangular  substitutions  with  the  Cholesky  factors  to  recover  the  solution 

x*. 

The  Cholesky  factorization 

JtJ  =  RtR  (10.15) 


(where  R  is  an  n  x  n  upper  triangular  with  positive  diagonal  elements)  is  guaranteed 
to  exist  when  m  >  n  and  J  has  rank  n.  This  method  is  frequently  used  in  practice 
and  is  often  effective,  but  it  has  one  significant  disadvantage,  namely,  that  the  condi¬ 
tion  number  of  JT  J  is  the  square  of  the  condition  number  of  J.  Since  the  relative  error 
in  the  computed  solution  of  a  problem  is  usually  proportional  to  the  condition  num¬ 
ber,  the  Cholesky-based  method  may  result  in  less  accurate  solutions  than  those  obtained 
from  methods  that  avoid  this  squaring  of  the  condition  number.  When  J  is  ill  condi¬ 
tioned,  the  Cholesky  factorization  process  may  even  break  down,  since  roundoff  errors 
may  cause  small  negative  elements  to  appear  on  the  diagonal  during  the  factorization 
process. 

A  second  approach  is  based  on  a  QR  factorization  of  the  matrix  J .  Since  the  Euclidean 
norm  of  any  vector  is  not  affected  by  orthogonal  transformations,  we  have 


Jx-y\\  =  II QT  (J*  —  v)  || 


(10.16) 


for  any  m  x  m  orthogonal  matrix  Q.  Suppose  we  perform  a  QR  factorization  with  column 
pivoting  on  the  matrix  J  (see  (A.24))  to  obtain 


JY\=Q 


R 

0 


[01  02  ] 


QiR, 


(10.17) 


where 


n  is  an  n  x  n  permutation  matrix  (hence,  orthogonal); 

Q  is  m  x  m  orthogonal; 

Q  i  is  the  first  n  columns  of  0,  while  02  contains  the  last  m  —  n  columns; 
R  is  n  x  n  upper  triangular  with  positive  diagonal  elements. 
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By  combining  (10.16)  and  (10.17),  we  obtain 


Jx  -  y\\\ 


Q\ 

Ql 


(inn7 


y) 


R 

(nrx)  - 

'  Qly  1 

0 

Qly 

|^(n7 


Qly \  + 


Qly  2  ■ 


(10.18) 


No  choice  of  x  has  any  effect  on  the  second  term  of  this  last  expression,  but  we  can  minimize 
||  Jx  —  y  ||  by  driving  the  first  term  to  zero,  that  is,  by  setting 

x*  =  n  R-lQ\y. 


(In  practice,  we  perform  a  triangular  substitution  to  solve  Rz  —  Qfy,  then  permute  the 
components  of  z  to  obtain  x*  =  IT;;.) 

This  QR-based  approach  does  not  degrade  the  conditioning  of  the  problem  unnec¬ 
essarily.  The  relative  error  in  the  final  computed  solution  x*  is  usually  proportional  to  the 
condition  number  of  J,  not  its  square,  and  this  method  is  usually  reliable.  Some  situations, 
however,  call  for  greater  robustness  or  more  information  about  the  sensitivity  of  the  solu¬ 
tion  to  perturbations  in  the  data  (7  or  y).  A  third  approach,  based  on  the  singular- value 
decomposition  (SVD)  of  J,  can  be  used  in  these  circumstances.  Recall  from  (A.  15)  that  the 
SVD  of  J  is  given  by 


J  —  U 


s 

0 


VT  =  [  Ui  u2  ] 


s 

0 


vT  =  UlSVT, 


(10.19) 


where 

U  is  m  x  m  orthogonal; 

U\  contains  the  first  n  columns  of  U ,  U2  the  last  m  —  n  columns; 

V  is  n  x  n  orthogonal; 

S  is  n  x  n  diagonal,  with  diagonal  elements  or  >  a2  >  •  •  •  >  er„  >  0. 

(Note  that  JT J  =  VS2VT,  so  that  the  columns  of  V  are  eigenvectors  of  JT J  with 
eigenvalues  aj,  j  —  1,2,  ,  n.)  By  following  the  same  logic  that  led  to  (10.18),  we  obtain 


r  5 " 

r  c/r _ 

(' VTx)~ 

ttT 

y 

0 

^2 

=  ||5(yrx)-t/1ry||2  +  ||f/2ry||2. 


(10.20) 
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Again,  the  optimum  is  found  by  choosing  x  to  make  the  first  term  equal  to  zero;  that  is, 

x*  =  VS-'ufy. 

Denoting  the  ith  columns  of  U  and  V  by  u,  e  R"1  and  Vj  e  R",  respectively,  we  have 

x*  =  it,-y-Vi-  (10.21) 

i  t  ' 

This  formula  yields  useful  information  about  the  sensitivity  of  x*.  When  a,  is  small,  x* 
is  particularly  sensitive  to  perturbations  in  y  that  affect  uj  y,  and  also  to  perturbations  in 
J  that  affect  this  same  quantity.  Such  information  is  particularly  useful  when  J  is  nearly 
rank-deficient,  that  is,  when  a,, /or  1.  It  is  sometimes  worth  the  extra  cost  of  the  SVD 
algorithm  to  obtain  this  sensitivity  information. 

All  three  approaches  above  have  their  place.  The  Cholesky-based  algorithm  is  partic¬ 
ularly  useful  when  m  n  and  it  is  practical  to  store  JT  J  but  not  J  itself.  It  can  also  be  less 
expensive  than  the  alternatives  when  m  n  and  J  is  sparse.  However,  this  approach  must 
be  modified  when  J  is  rank-deficient  or  ill  conditioned  to  allow  pivoting  of  the  diagonal 
elements  of  JT  J.  The  QR  approach  avoids  squaring  of  the  condition  number  and  hence 
may  be  more  numerically  robust.  While  potentially  the  most  expensive,  the  SVD  approach 
is  the  most  robust  and  reliable  of  all.  When  J  is  actually  rank-deficient,  some  of  the  singular 
values  <7;  are  exactly  zero,  and  any  vector  x*  of  the  form 

T 

X*  =  'Yh  ~L^~v‘  +  X!  T/Vi  (10.22) 

ct,^0  '  a,=0 

(for  arbitrary  coefficients  r ,■ )  is  a  minimizer  of  ( 10.20).  Frequently,  the  solution  with  smallest 
norm  is  the  most  desirable,  and  we  obtain  it  by  setting  each  r ,■  =  0  in  (10.22).  When  J  has 
full  rank  but  is  ill  conditioned,  the  last  few  singular  values  crn,  . . .  are  small  relative 
to  or.  The  coefficients  uj y/cr,  in  (10.22)  are  particularly  sensitive  to  perturbations  in  uf  y 
when  <jj  is  small,  so  an  approximate  solution  that  is  less  sentitive  to  perturbations  than  the 
true  solution  can  be  obtained  by  omitting  these  terms  from  the  summation. 

When  the  problem  is  very  large,  it  may  be  efficient  to  use  iterative  techniques,  such 
as  the  conjugate  gradient  method,  to  solve  the  normal  equations  (10.14).  A  direct  imple¬ 
mentation  of  conjugate  gradients  (Algorithm  5.2)  requires  one  matrix  vector  multiplication 
with  JT  J  to  be  performed  at  each  iteration.  This  operation  can  be  performed  by  means  of 
successive  multiplications  by  J  and  JT;  we  need  only  the  ability  to  perform  matrix- vector 
multiplications  with  these  two  matrices  to  implement  this  algorithm.  Several  modifications 
of  the  conjugate  gradient  approach  have  been  proposed  that  involve  a  similar  amount  of 
work  per  iteration  (one  matrix-vector  multiplication  each  with  J  and  JT )  but  that  have 
superior  numerical  properties.  Some  alternatives  are  described  by  Paige  and  Saunders  [234] , 
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who  propose  in  particular  an  algorithm  called  LSQR  which  has  become  the  basis  of  a  highly 
successful  code. 


1 0.3  ALGORITHMS  FOR  NONLINEAR  LEAST-SQUARES 
PROBLEMS 

THE  GAUSS-NEWTON  METHOD 

We  now  describe  methods  for  minimizing  the  nonlinear  objective  function  (10.1)  that 
exploit  the  structure  in  the  gradient  V/  (10.4)  and  Hessian  V2/  ( 10.5).  The  simplest  of  these 
methods — the  Gauss-Newton  method — can  be  viewed  as  a  modified  Newton’s  method  with 
line  search.  Instead  of  solving  the  standard  Newton  equations  V2/(x*)p  —  —  V/(x^),  we 
solve  instead  the  following  system  to  obtain  the  search  direction  pf*: 

Jk  JkPT  =  ~Jk  rk-  (10-23) 

This  simple  modification  gives  a  number  of  advantages  over  the  plain  Newton’s  method. 
First,  our  use  of  the  approximation 


V2fk*j{Jk 


(10.24) 


saves  us  the  trouble  of  computing  the  individual  residual  Hessians  V2rj,  j  —  1,2,  ,  m, 

which  are  needed  in  the  second  term  in  (10.5).  In  fact,  ifwe  calculated  the  Jacobian  in  the 
course  of  evaluating  the  gradient  V  /).  =  j[ rk,  the  approximation  (10.24)  does  not  require 
any  additional  derivative  evaluations,  and  the  savings  in  computational  time  can  be  quite 
significant  in  some  applications.  Second,  there  are  many  interesting  situations  in  which  the 
first  term  JT  J  in  (10.5)  dominates  the  second  term  (at  least  close  to  the  solution  x*),  so 
that  jf  Jk  is  a  close  approximation  to  V2/)-  and  the  convergence  rate  of  Gauss-Newton  is 
similar  to  that  of  Newton’s  method.  The  first  term  in  (10.5)  will  be  dominant  when  the 
norm  of  each  second-order  term  (that  is,  |ty(x)|||V2r7(x)||)  is  significantly  smaller  than  the 
eigenvalues  of  JT  J .  As  mentioned  in  the  introduction,  we  tend  to  see  this  behavior  when 
either  the  residuals  r j  are  small  or  when  they  are  nearly  affine  (so  that  the  ||  V2r;- 1|  are  small). 
In  practice,  many  least-squares  problems  have  small  residuals  at  the  solution,  leading  to 
rapid  local  convergence  of  Gauss-Newton. 

A  third  advantage  of  Gauss-Newton  is  that  whenever  Jk  has  full  rank  and  the  gradient 
V/*  is  nonzero,  the  direction  p™  is  a  descent  direction  for  /,  and  therefore  a  suitable 
direction  for  a  line  search.  From  (10.4)  and  ( 10.23)  we  have 


(ptN)TVfk  =  ipT)TJlrk  =  -(p™)T Jl  Jkpf  =  -\\Jkpf*\\2  <  0. 


(10.25) 
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The  final  inequality  is  strict  unless  Jk  pf  —  0,  in  which  case  we  have  by  ( 10.23)  and  full  rank 
of  Jk  that  jf  rk  —  V  fk  —  0;  that  is,  Xk  is  a  stationary  point.  Finally,  the  fourth  advantage 
of  Gauss-Newton  arises  from  the  similarity  between  the  equations  (10.23)  and  the  normal 
equations  (10.14)  for  the  linear  least-squares  problem.  This  connection  tells  us  that  pf*  is 
in  fact  the  solution  of  the  linear  least-squares  problem 

min  \\\Jkp  +  rk\\2.  (10.26) 

p 

Hence,  we  can  find  the  search  direction  by  applying  linear  least- squares  algorithms  to  the 
subproblem  (10.26).  In  fact,  if  the  QR  or  SVD-based  algorithms  are  used,  there  is  no  need 
to  calculate  the  Hessian  approximation  jf  Jk  in  ( 10.23)  explicitly;  we  can  work  directly  with 
the  Jacobian  Jk.  The  same  is  true  if  we  use  a  conjugate-gradient  technique  to  solve  (10.26). 
For  this  method  we  need  to  perform  matrix- vector  multiplications  with  jf  Jk,  which  can 
be  done  by  first  multiplying  by  Jk  and  then  by  Jf . 

If  the  number  of  residuals  m  is  large  while  the  number  of  variables  n  is  relatively 
small,  it  may  be  unwise  to  store  the  Jacobian  J  explicitly.  A  preferable  strategy  may  be  to 
calculate  the  matrix  /  T  J  and  gradient  vector  J  T  r  by  evaluating  rj  and  V  r;-  successively  for 
j  =  1,  2, . . . ,  m  and  performing  the  accumulations 


m  m 

JTJ  =  ^(v0)(Vr;)r,  JTr  =  J2ri(yrj).  (10.27) 

i=i  ;=i 

The  Gauss-Newton  steps  can  then  be  computed  by  solving  the  system  (10.23)  of  normal 
equations  directly. 

The  subproblem  (10.26)  suggests  another  motivation  for  the  Gauss-Newton  search 
direction.  We  can  view  this  equation  as  being  obtained  from  a  linear  model  for  the  the  vector 
function  r(xk  +  p)  ~  rk  +  Jkp,  substituted  into  the  function  1 1|  •  ||2.  In  other  words,  we 
use  the  approximation 

f(xk  +  p)  =  \\\r(xk  +  p) ||2  \\\Jkp  +  rk\\2 , 

and  choose  pf  to  be  the  minimizer  of  this  approximation. 

Implementations  of  the  Gauss-Newton  method  usually  perform  a  line  search  in  the 
direction  pf,  requiring  the  step  length  ak  to  satisfy  conditions  like  those  discussed  in 
Chapter  3,  such  as  the  Armijo  and  Wolfe  conditions;  see  (3.4)  and  (3.6). 

CONVERGENCE  OF  THE  GAUSS-NEWTON  METHOD 

The  theory  of  Chapter  3  can  applied  to  study  the  convergence  properties  of  the 
Gauss-Newton  method.  We  prove  a  global  convergence  result  under  the  assumption  that 
the  Jacobians  J(x)  have  their  singular  values  uniformly  bounded  away  from  zero  in  the 
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region  of  interest;  that  is,  there  is  a  constant  y  >  0  such  that 

ll-/(^)z||  >  Klkll  (10.28) 

for  all  x  in  a  neighborhood  A f  of  the  level  set 

£  —  [x  |  f(x)  <  /(x0)},  (10.29) 

where  xo  is  the  starting  point  for  the  algorithm.  We  assume  here  and  in  the  rest  of  the  chapter 
that  £  is  bounded.  Our  result  is  a  consequence  of  Theorem  3.2. 

Theorem  10.1. 

Suppose  each  residual  function  rj  is  Lipschitz  continuously  differentiable  in  a  neigh¬ 
borhood  Af  of  the  bounded  level  set  (10.29),  and  that  the  Jacobians  J(x)  satisfy  the  uniform 
full-rank  condition  (10.28)  onAf.  Then  if  the  iterates  Xk  are  generated  by  the  Gauss-Newton 
method  with  step  lengths  c^-  that  satisfy  (3.6),  we  have 

lim  jf  rk  —  0. 

k—>OQ 

PROOF.  First,  we  note  that  the  neighborhood  Af  of  the  bounded  level  set  £  can  be  chosen 
small  enough  that  the  following  properties  are  satisfied  for  some  positive  constants  L  and  f: 

\r j(x)\  <  P  and  ||  Vr y(x)||  <  P, 

\rj(x)  —  rj(x)\<L\\x—x\\  and  || Wrj(x)  —  Vr7(x)||  <  L\\x  —  x||, 

for  all  x,  x  e  Af  and  all  j  —  1,  2, . . . ,  m.  It  is  easy  to  deduce  that  there  exists  a  constant 
f  >  0  such  that  ||7(x)r||  =  ||/(x)||  <  f  for  all  x  G  £.  In  addition,  by  applying  the 
results  concerning  Lipschitz  continuity  of  products  and  sums  (see  for  example  (A.43))  to 
the  gradient  V/(x)  =  Y^"j= i  r; (x)Vr7(x),  we  can  show  that  V/  is  Lipschitz  continuous. 
Hence,  the  assumptions  of  Theorem  3.2  are  satisfied. 

We  check  next  that  the  angle  0k  between  the  search  direction  pf1  and  the  negative 
gradient  —  V/*  is  uniformly  bounded  away  from  tt/2.  From  (3.12),  (10.25),  and  (10.28), 
we  have  for  x  =  Xk  €  £  and  pGli  —  pf1  that 

n  (V/)r/rGN  ||7/tgn||2  ^  /2||pGN||2  y2  n 

||pGN||||V/||  ||/TGN||||/r7pGN||  _  /S2||pGN||2  P2 

It  follows  from  (3.14)  in  Theorem  3.2  that  V/(x^)  ->  0,  giving  the  result.  □ 

If  Jk  is  rank- deficient  for  some  k  (so  that  a  condition  like  (10.28)  is  not  satisfied),  the 
coefficient  matrix  in  (10.23)  is  singular.  The  system  (10.23)  still  has  a  solution,  however, 
because  of  the  equivalence  between  this  linear  system  and  the  minimization  problem  ( 10.26) . 
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In  fact,  there  are  infinitely  many  solutions  for  p™  in  this  case;  each  of  them  has  the  form 
of  (10.22).  However,  there  is  no  longer  an  assurance  that  cos  6k  is  uniformly  bounded  away 
from  zero,  so  we  cannot  prove  a  result  like  Theorem  10.1. 

The  convergence  of  Gauss-Newton  to  a  solution  x*  can  be  rapid  if  the  leading  term 
j[  Jk  dominates  the  second-order  term  in  the  Hessian  (10.5).  Suppose  that  xk  is  close  to 
x*  and  that  assumption  (10.28)  is  satisfied.  Then,  applying  an  argument  like  the  Newton’s 
method  analysis  (3.31),  (3.32),  (3.33)  in  Chapter  3,  we  have  for  a  unit  step  in  the  Gauss- 
Newton  direction  that 


xk  +  PT  ~  x *  =  xk  -  x*  -  [JT  J{xk)]  *V f(xk) 

=  [JTJ{xk)]~l  [jTJ(xk)(xk  -  x*)  +  V/(x*)  -  V/(x*)]  , 


where  J  T  J  (x )  is  shorthand  notation  for  J{x)T  J  (x ) .  Using  H (x )  to  denote  the  second-order 
term  in  (10.5),  we  have  from  (A.57)  that 


V/te)  -  V/(jc*) 


f 


JT  J(x*  +  t(xk  —  x*))(xk  —  x*)dt 

+  f  H(x*  +  t(xk  —  x*)){xk  —  x*)  dt. 

Jo 


A  similar  argument  as  in  (3.32),  (3.33),  assuming  Lipschitz  continuity  of  //(•)  near  x*, 
shows  that 


II**  +  pT  ~  *1 

<  [  \\{JT J(xk)]~lH(x*  +t{xk-x*))\\\\xk-x*\\dt+  0{\\xk-x*\\2) 

Jo 

»  II  [/r/(**)]-1H(x*)||  ||x,  -  x*||  +  0(\\xk  -  x*||2).  (10.30) 


Hence,  if  ||  [JT  J  (x*)]~l  H  (x*)\\  1,  we  can  expect  a  unit  step  of  Gauss-Newton  to  move 
us  much  closer  to  the  solution  x*\  giving  rapid  local  convergence.  When  H{x*)  —  0,  the 
convergence  is  actually  quadratic. 

When  n  and  m  are  both  large  and  the  Jacobian  J(x)  is  sparse,  the  cost  of  computing 
steps  exactly  by  factoring  either  Jk  or  J[  Jk  at  each  iteration  may  become  quite  expensive 
relative  to  the  cost  of  function  and  gradient  evaluations.  In  this  case,  we  can  design  inexact 
variants  of  the  Gauss-Newton  algorithm  that  are  analogous  to  the  inexact  Newton  algo¬ 
rithms  discussed  in  Chapter  7.  We  simply  replace  the  Hessian  V2/(x^)  in  these  methods  by 
its  approximation  7Ar  Jk.  The  positive  semidefiniteness  of  this  approximation  simplifies  the 
resulting  algorithms  in  several  places. 
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THE  LEVENBERG-MARQUARDT  METHOD 

Recall  that  the  Gauss-Newton  method  is  like  Newton’s  method  with  line  search,  except 
that  we  use  the  convenient  and  often  effective  approximation  (10.24)  for  the  Hessian.  The 
Levenberg-Marquardt  method  can  be  obtained  by  using  the  same  Hessian  approximation, 
but  replacing  the  line  search  with  a  trust-region  strategy.  The  use  of  a  trust  region  avoids 
one  of  the  weaknesses  of  Gauss-Newton,  namely,  its  behavior  when  the  Jacobian  J(x)  is 
rank-deficient,  or  nearly  so.  Since  the  same  Hessian  approximations  are  used  in  each  case, 
the  local  convergence  properties  of  the  two  methods  are  similar. 

The  Levenberg-Marquardt  method  can  be  described  and  analyzed  using  the  trust- 
region  framework  of  Chapter  4.  (In  fact,  the  Levenberg-Marquardt  method  is  sometimes 
considered  to  be  the  progenitor  of  the  trust-region  approach  for  general  unconstrained 
optimization  discussed  in  Chapter  4.)  For  a  spherical  trust  region,  the  subproblem  to  be 
solved  at  each  iteration  is 

min  l\\Jkp  +  rk\\2,  subject  to  ||  p\\  <  Ak,  (10.31) 

p 

where  >  0  is  the  trust-region  radius.  In  effect,  we  are  choosing  the  model  function  mk{-) 
in  (4.3)  to  be 


mk(p)  —  f  ||r,t||2  +  pT  rk  +  \pT  Jl  Jkp.  (10.32) 

We  drop  the  iteration  counter  k  during  the  rest  of  this  section  and  concern  ourselves 
with  the  subproblem  ( 10.31).  The  results  of  Chapter  4  allow  us  to  characterize  the  solution 
of  (10.31)  in  the  following  way:  When  the  solution  paN  of  the  Gauss-Newton  equations 
(10.23)  lies  strictly  inside  the  trust  region  (that  is,  ||/?GN||  <  A),  then  this  step  pGN  also  solves 
the  subproblem  (10.31).  Otherwise,  there  is  a  X  >  0  such  that  the  solution  p  =  pLM  of 
(10.31)  satisfies  ||/?||  =  A  and 


(JTJ  +  Xl)p  =  -JTr.  (10.33) 

This  claim  is  verified  in  the  following  lemma,  which  is  a  straightforward  consequence  of 
Theorem  4.1  from  Chapter  4. 

Lemma  10.2. 

The  vector  pLM  is  a  solution  of  the  trust-region  subproblem 

min  \\Jp  +  r\\2,  subject  to  ||p||  <  A, 

p 

if  and  only  if  pLM  is  feasible  and  there  is  a  scalar  X  >  0  such  that 


{JTJ  +  XI)pLU  =  -  JTr , 

MA-|l/ni)  =  0. 


(10.34a) 

(10.34b) 
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PROOF.  In  Theorem  4.1,  the  semidefiniteness  condition  (4.8c)  is  satisfied  automatically, 
since  JT  J  is  positive  semidefinite  and  X  >  0.  The  two  conditions  (10.34a)  and  (10.34b) 
follow  from  (4.8a)  and  (4.8b),  respectively.  □ 

Note  that  the  equations  ( 10.33)  are  just  the  normal  equations  for  the  following  linear 
least-squares  problem: 


2 

(10.35) 

Just  as  in  the  Gauss-Newton  case,  the  equivalence  between  (10.33)  and  (10.35)  gives  us  a 
way  of  solving  the  subproblem  without  computing  the  matrix-matrix  product  JT  J  and  its 
Cholesky  factorization. 
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IMPLEMENTATION  OF  THE  LEVENBERG-MARQUARDT  METHOD 

To  find  a  value  of  X  that  approximately  matches  the  given  A  in  Lemma  10.2,  we  can 
use  the  rootfinding  algorithm  described  in  Chapter  4.  It  is  easy  to  safeguard  this  procedure: 
The  Cholesky  factor  R  is  guaranteed  to  exist  whenever  the  current  estimate  X ^  is  positive, 
since  the  approximate  Hessian  B  =  JT  J  is  already  positive  semidefinite.  Because  of  the 
special  structure  of  B,  we  do  not  need  to  compute  the  Cholesky  factorization  of  B  +  XI 
from  scratch  in  each  iteration  of  Algorithm  4.1.  Rather,  we  present  an  efficient  technique 
for  finding  the  following  QR  factorization  of  the  coefficient  matrix  in  ( 10.35): 
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(10.36) 


(Qx  orthogonal,  R-,  upper  triangular).  The  upper  triangular  factor  Rx  satisfies  R^  Rx  — 
(JT  J  +  XI). 

We  can  save  computer  time  in  the  calculation  of  the  factorization  (10.36)  by  using 
a  combination  of  Householder  and  Givens  transformations.  Suppose  we  use  Householder 
transformations  to  calculate  the  QR  factorization  of  J  alone  as 


J  =  Q 


R 

0 


(10.37) 


We  then  have 
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The  leftmost  matrix  in  this  formula  is  upper  triangular  except  for  the  n  nonzero  terms  of 
the  matrix  XI .  These  can  be  eliminated  by  a  sequence  of  n(n  +  l)/2  Givens  rotations,  in 
which  the  diagonal  elements  of  the  upper  triangular  part  are  used  to  eliminate  the  nonzeros 
of  XI  and  the  fill-in  terms  that  arise  in  the  process.  The  first  few  steps  of  this  process  are  as 
follows: 

rotate  row  n  of  R  with  row  n  of  yfkl,  to  eliminate  the  («,  n)  element  of  Vxi; 

rotate  row  n  —  1  of  R  with  row  n  —  1  of  Vxi  to  eliminate  the  (n  —  1,  n  —  1)  element 
of  the  latter  matrix.  This  step  introduces  fill-in  in  position  (n  —  1 ,  n)  of  yfkl ,  which 
is  eliminated  by  rotating  row  n  of  R  with  row  n  —  1  of  Vxi,  to  eliminate  the  fill-in 
element  at  position  (n  —  1,  n); 

rotate  row  n  —  2  of  R  with  row  n  —  2  of  Vxi,  to  eliminate  the  (. n  —  2)  diagonal  in  the 
latter  matrix.  This  step  introduces  fill-in  in  the  (n  —  2,  n  —  1)  and  (n  —  2,  n )  positions, 
which  we  eliminate  by  •  •  • 

and  so  on.  If  we  gather  all  the  Givens  rotations  into  a  matrix  Qx,  we  obtain  from  (10.38) 
that 


and  hence  (10.36)  holds  with 

Q 

Q,  =  Qx. 

The  advantage  of  this  combined  approach  is  that  when  the  value  of  X  is  changed  in  the 
rootfinding  algorithm,  we  need  only  recalculate  Q\  and  not  the  Householder  part  of  the 
factorization  (10.38).  This  feature  can  save  a  lot  of  computation  in  the  case  of  m  n,  since 
just  0(n 3)  operations  are  required  to  recalculate  Qx  and  Rx  for  each  value  of  X,  after  the 
initial  cost  of  0(mn2)  operations  needed  to  calculate  Q  in  (10.37). 

Least-squares  problems  are  often  poorly  scaled.  Some  of  the  variables  could  have 
values  of  about  104,  while  other  variables  could  be  of  order  10-6.  If  such  wide  variations  are 
ignored,  the  algorithms  above  may  encounter  numerical  difficulties  or  produce  solutions  of 
poor  quality.  One  way  to  reduce  the  effects  of  poor  scaling  is  to  use  an  ellipsoidal  trust  region 
in  place  of  the  spherical  trust  region  defined  above.  The  step  is  confined  to  an  ellipse  in 
which  the  lengths  of  the  principal  axes  are  related  to  the  typical  values  of  the  corresponding 
variables.  Analytically,  the  trust-region  subproblem  becomes 

min  i  ||  Jkp  +  rk || 2 ,  subject  to  ||  Dkp  ||  <  A* , 

p 


(10.39) 
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where  Dk  is  a  diagonal  matrix  with  positive  diagonal  entries  (cf.  (7.13)).  Instead  of  (10.33), 
the  solution  of  ( 10.39)  satisfies  an  equation  of  the  form 


(  J[jk  +  kDl)  p'f'  =  -f[rk. 


(10.40) 


and,  equivalently,  solves  the  linear  least- squares  problem 


2 

(10.41) 

The  diagonals  of  the  scaling  matrix  Dk  can  change  from  iteration  to  iteration,  as  we  gather 
information  about  the  typical  range  of  values  for  each  component  of  x.  If  the  variation  in 
these  elements  is  kept  within  certain  bounds,  then  the  convergence  theory  for  the  spherical 
case  continues  to  hold,  with  minor  modifications.  Moreover,  the  technique  described  above 
for  calculating  Rk  needs  no  modification.  Seber  and  Wild  [280]  suggest  choosing  the 
diagonals  of  D\  to  match  those  of  jf  Jk,  to  make  the  algorithm  invariant  under  diagonal 
scaling  of  the  components  of  x.  This  approach  is  analogous  to  the  technique  of  scaling 
by  diagonal  elements  of  the  Hessian,  which  was  described  in  Section  4.5  in  the  context  of 
trust-region  algorithms  for  unconstrained  optimization. 

For  problems  in  which  m  and  n  are  large  and  J(x)  is  sparse,  we  may  prefer  to  solve 
(10.31)  or  (10.39)  approximately  using  the  CG-Steihaug  algorithm,  Algorithm  7.2  from 
Chapter  7,  with  jjf  Jk  replacing  the  exact  Hessian  V2/)-.  Positive  semidefiniteness  of  the 
matrix  jf  Jk  makes  for  some  simplification  of  this  algorithm,  because  negative  curvature 
cannot  arise.  It  is  not  necessary  to  calculate  Jjf  Jk  explicitly  to  implement  Algorithm  7.2;  the 
matrix-vector  products  required  by  the  algorithm  can  be  found  by  forming  matrix-vector 
products  with  Jk  and  j[  separately. 
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CONVERGENCE  OF  THE  LEVENBERG-MARQUARDT  METHOD 

It  is  not  necessary  to  solve  the  trust-region  problem  (10.31)  exactly  in  order  for 
the  Levenberg-Marquardt  method  to  enjoy  global  convergence  properties.  The  following 
convergence  result  can  be  obtained  as  a  direct  consequence  of  Theorem  4.6. 

Theorem  10.3. 

Let  q  e  (0,  j)  in  Algorithm  4.1  of  Chapter  4,  and  suppose  that  the  level  set  £  defined 
in  (10.29)  is  bounded  and  that  the  residual  functions  rjf),  j  =  1,2, ...  ,m  are  Lipschitz 
continuously  differentiable  in  a  neighborhood  N  of  C.  Assume  that  for  each  k,  the  approximate 
solution  pk  of  (10.31)  satisfies  the  inequality 


mk( 0)  -  mk(pk)  >  ci\\j[rk\\  min 


(10.42) 
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for  some  constant  c\  >  0,  and  in  addition  ||/j*||  <  y  A*,  for  some  constant  y  >  1  .We  then 
have  that 


lim  V  fk  —  lim  j[  —  0. 

k — >  oo  Ic^-oo 

PROOF.  The  smoothness  assumption  on  rj  (• )  implies  that  we  can  choose  a  constant  M  >  0 
such  that  ||  jf  Jk  ||  <  M  for  all  iterates  k.  Note  too  that  the  objective  /  is  bounded  below 
(by  zero).  Hence,  the  assumptions  of  Theorem  4.6  are  satisfied,  and  the  result  follows 
immediately.  □ 

As  in  Chapter  4,  there  is  no  need  to  calculate  the  right-hand-side  in  the  inequality 
( 10.42)  or  to  check  it  explicitly.  Instead,  we  can  simply  require  the  decrease  given  by  our 
approximate  solution  pk  of  ( 10.3 1 )  to  at  least  match  the  decrease  given  by  the  Cauchy  point, 
which  can  be  calculated  inexpensively  in  the  same  way  as  in  Chapter  4.  If  we  use  the  iterative 
CG-Steihaug  approach,  Algorithm  7.2,  the  condition  (10.42)  is  satisfied  automatically  for 
Ci  =  1/2,  since  the  Cauchy  point  is  the  first  estimate  of  pk  computed  by  this  approach, 
while  subsequent  estimates  give  smaller  values  for  the  model  function. 

The  local  convergence  behavior  of  Levenberg-Marquardt  is  similar  to  the  Gauss- 
Newton  method.  Near  a  solution  x*  at  which  the  first  term  of  the  Hessian  V2/(x*)  (10.5) 
dominates  the  second  term,  the  model  function  in  (10.31),  the  trust  region  becomes  inactive 
and  the  algorithm  takes  Gauss-Newton  steps,  giving  the  rapid  local  convergence  expression 
(10.30). 

METHODS  FOR  LARGE-RESIDUAL  PROBLEMS 

In  large-residual  problems,  the  quadratic  model  in  (10.31)  is  an  inadequate  repre¬ 
sentation  of  the  function  /  because  the  second-order  part  of  the  Hessian  V2/(x)  is  too 
significant  to  be  ignored.  In  data-fitting  problems,  the  presence  of  large  residuals  may 
indicate  that  the  model  is  inadequate  or  that  errors  have  been  made  in  monitoring  the 
observations.  Still,  the  practitioner  may  need  to  solve  the  least-squares  problem  with  the 
current  model  and  data,  to  indicate  where  improvements  are  needed  in  the  weighting  of 
observations,  modeling,  or  data  collection. 

On  large-residual  problems,  the  asymptotic  convergence  rate  of  Gauss-Newton  and 
Levenberg-Marquardt  algorithms  is  only  linear — slower  than  the  superlinear  convergence 
rate  attained  by  algorithms  for  general  unconstrained  problems,  such  as  Newton  or  quasi- 
Newton.  If  the  individual  Hessians  V2r,-  are  easy  to  calculate,  it  may  be  better  to  ignore  the 
structure  of  the  least-squares  objective  and  apply  Newton’s  method  with  trust  region  or  line 
search  to  the  problem  of  minimizing  /.  Quasi-Newton  methods,  which  attain  a  superlin¬ 
ear  convergence  rate  without  requiring  calculation  of  V2r,-,  are  another  option.  However, 
the  behavior  of  both  Newton  and  quasi-Newton  on  early  iterations  (before  reaching  a 
neighborhood  of  the  solution)  maybe  inferior  to  Gauss-Newton  and  Levenberg-Marquardt. 
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Of  course,  we  often  do  not  know  beforehand  whether  a  problem  will  turn  out  to 
have  small  or  large  residuals  at  the  solution.  It  seems  reasonable,  therefore,  to  consider 
hybrid  algorithms,  which  would  behave  like  Gauss-Newton  or  Levenberg-Marquardt  if  the 
residuals  turn  out  to  be  small  (and  hence  take  advantage  of  the  cost  savings  associated  with 
these  methods)  but  switch  to  Newton  or  quasi-Newton  steps  if  the  residuals  at  the  solution 
appear  to  be  large. 

There  are  a  couple  of  ways  to  construct  hybrid  algorithms.  One  approach,  due  to 
Fletcher  and  Xu  (see  Fletcher  [101]),  maintains  a  sequence  of  positive  definite  Hessian  ap¬ 
proximations  Bk.  If  the  Gauss-Newton  step  from  xk  reduces  the  function  /  by  a  certain 
fixed  amount  (say,  a  factor  of  5),  then  this  step  is  taken  and  Bk  is  overwritten  by  Jk. 
Otherwise,  a  direction  is  computed  using  B and  the  new  point  x*+i  is  obtained  by  per¬ 
forming  a  line  search.  In  either  case,  a  BFGS-like  update  is  applied  to  Bk  to  obtain  a  new 
approximation  Bk+\.  In  the  zero-residual  case,  the  method  eventually  always  takes  Gauss- 
Newton  steps  (giving  quadratic  convergence),  while  it  eventually  reduces  to  BFGS  in  the 
nonzero-residual  case  (giving  superlinear  convergence).  Numerical  results  in  Fletcher  [101, 
Tables  6.1.2,  6.1.3]  show  good  results  for  this  approach  on  small-,  large-,  and  zero-residual 
problems. 

A  second  way  to  combine  Gauss-Newton  and  quasi-Newton  ideas  is  to  maintain 
approximations  to  just  the  second-order  part  of  the  Hessian.  That  is,  we  maintain  a  sequence 
of  matrices  Sk  that  approximate  the  summation  term  Y^'j=\  rj(xk)S72rj(xk)  in  (10.5),  and 
then  use  the  overall  Hessian  approximation 


Bk  =  Jk  +  Sk 


in  a  trust-region  or  line  search  model  for  calculating  the  step  pk.  Updates  to  Sk  are  devised 
so  that  the  approximate  Hessian  Bk,  or  its  constituent  parts,  mimics  the  behavior  of  the 
corresponding  exact  quantities  over  the  step  just  taken.  The  update  formula  is  based  on  a 
secant  equation,  which  arises  also  in  the  context  of  unconstrained  minimization  (6.6)  and 
nonlinear  equations  (11.27).  In  the  present  instance,  there  are  a  number  of  different  ways 
to  define  the  secant  equation  and  to  specify  the  other  conditions  needed  for  a  complete 
update  formula  for  Sk-  We  describe  the  algorithm  of  Dennis,  Gay,  and  Welsch  [90],  which 
is  probably  the  best-known  algorithm  in  this  class  because  of  its  implementation  in  the 
well-known  nl2sol  package. 

In  [90] ,  the  secant  equation  is  motivated  in  the  following  way.  Ideally,  Sk+l  should  be 
a  close  approximation  to  the  exact  second-order  term  at  x  —  Xk+v,  that  is, 


Sk+ 1  ^  'y'jj(xk+1)V2rj(xk+i). 

.7=1 


Since  we  do  not  want  to  calculate  the  individual  Hessians  V2r,-  in  this  formula,  we  could 
replace  each  of  them  with  an  approximation  (Bj)k+ 1  and  impose  the  condition  that  {Bj)k+ 1 
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should  mimic  the  behavior  of  its  exact  counterpart  V2r;-  over  the  step  just  taken;  that 
is, 


(Bj)k+  i{xk+i  -  xk)  —  Vrj{xk+ 1)  -  Vr j (xk ) 


=  (row  j  of  J(xk+  i))r  -  (row  j  of  J{xk))T. 
This  condition  leads  to  a  secant  equation  on  S*+1,  namely, 


Sk+i{xk+i  -  xk)  —  y^trj(xk+i)(Bj)k+i(xk+i  -  xk) 

7=1 

m 

=  Y2rj(xk+ 1)  [(row  j  of  J{xk+  i))T  -  (row  j  of  J(xk))T] 
7=1 

=  Jk+\rk+\  ~  Jkrk+ 1- 


As  usual,  this  condition  does  not  completely  specify  the  new  approximation  Sk+  Dennis, 
Gay,  and  Welsch  add  requirements  that  Sk+\  be  symmetric  and  that  the  difference  Sk+ 1  —  Sk 
from  the  previous  estimate  Sk  be  minimized  in  a  certain  sense,  and  derive  the  following 
update  formula: 


Sk+  i  —  Sk  + 


(y11  -  Sks)yT  +  y(ytt  -  Sks)! 
yT  s 


(y#  -  Sks)Ts  T 

(yTs)2  yy  ’ 


(10.43) 


where 


T  —  Xk , 

y  =  Jk+lrk+1  -  Jk  rk, 
y#  =  Jk+1rk+i  -  Jk  rk+1. 

Note  that  (10.43)  is  a  slight  variant  on  the  DFP  update  for  unconstrained  minimization.  It 
would  be  identical  if  yB  and  y  were  the  same. 

Dennis,  Gay,  and  Welsch  use  their  approximate  Hessian  Jj  Jk  +  Sk  in  conjunction 
with  a  trust-region  strategy,  but  a  few  more  features  are  needed  to  enhance  its  performance. 
One  deficiency  of  its  basic  update  strategy  for  Sk  is  that  this  matrix  is  not  guaranteed  to 
vanish  as  the  iterates  approach  a  zero-residual  solution,  so  it  can  interfere  with  superlinear 
convergence.  This  problem  is  avoided  by  scaling  Sk  prior  to  its  update;  we  replace  Sk  by  rk  Sk 
on  the  right-hand-side  of  (10.43),  where 


rk  =  mm 
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A  final  modification  in  the  overall  algorithm  is  that  the  S*  term  is  omitted  from  the  Hessian 
approximation  when  the  resulting  Gauss-Newton  model  produces  a  sufficiently  good  step. 


10.4  ORTHOGONAL  DISTANCE  REGRESSION 


In  Example  10. 1  we  assumed  that  no  errors  were  made  in  noting  the  time  at  which  the  blood 
samples  were  drawn,  so  that  the  differences  between  the  model  tj )  and  the  observation 
y j  were  due  to  inadequacy  in  the  model  or  measurement  errors  in  y;- .  We  assumed  that 
any  errors  in  the  ordinates — the  times  tj — are  tiny  by  comparison  with  the  errors  in  the 
observations.  This  assumption  often  is  reasonable,  but  there  are  cases  where  the  answer  can 
be  seriously  distorted  if  we  fail  to  take  possible  errors  in  the  ordinates  into  account.  Models 
that  take  these  errors  into  account  are  known  in  the  statistics  literature  as  errors-in-variables 
models  [280,  Chapter  10],  and  the  resulting  optimization  problems  are  referred  to  as  total 
least  squares  in  the  case  of  a  linear  model  (see  Golub  and  Van  Loan  [136,  Chapter  5] )  or  as 
orthogonal  distance  regression  in  the  nonlinear  case  (see  Boggs,  Byrd,  and  Schnabel  [30]). 

We  formulate  this  problem  mathematically  by  introducing  perturbations  Sj  for  the 
ordinates  tj ,  as  well  as  perturbations  €j  for  yj ,  and  seeking  the  values  of  these  2m  perturba¬ 
tions  that  minimize  the  discrepancy  between  the  model  and  the  observations,  as  measured 
by  a  weighted  least-squares  objective  function.  To  be  precise,  we  relate  the  quantities  tj,  yj, 
8  j ,  and  Cj  by 


yj  —  <p{x\  tj  +  8j )  +  €j,  j  —  1,2, ,  m,  (10.44) 

and  define  the  minimization  problem  as 

m 

min  |  ^  wjej  +  djsj,  subject  to  (10.44).  (10.45) 

1= i 

The  quantities  w,  and  d,  are  weights,  selected  either  by  the  modeler  or  by  some  automatic 
estimate  of  the  relative  significance  of  the  error  terms. 

It  is  easy  to  see  how  the  term  “orthogonal  distance  regression”  originates  when  we 
graph  this  problem;  see  Figure  10.2.  If  all  the  weights  w,  and  l/,  are  equal,  then  each  term 
in  the  summation  ( 10.45)  is  simply  the  shortest  distance  between  the  point  ( t j,  yj)  and  the 
curve  <p{x\  t )  (plotted  as  a  function  of  t).  The  shortest  path  between  each  point  and  the 
curve  is  orthogonal  to  the  curve  at  the  point  of  intersection. 

Using  the  constraints  (10.44)  to  eliminate  the  variables  €j  from  (10.45),  we  obtain  the 
unconstrained  least-squares  problem 

m  2m 

min  F(x,  8)  =  ^  w][ yj  -  </>(x;  tj  +  8j )]2  +  djsj  —  |  y ^r)(x,  8),  (10.46) 

i=i  l=i 
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Figure  10.2  Orthogonal  distance  regression  minimizes  the  sum  of  squares  of  the 
distance  from  each  point  to  the  curve. 


where  S  —  (Si,  82,  . . . ,  Sm)T  and  we  have  defined 


[  Wj[<p(x;tj+Sj)-yj],  j  —  1,  2, ...  ,m, 
rj{x,8)—\  (10.47) 

[  dj_m8j_m,  j  —  m  +  1, . . . ,  2m. 

Note  that  (10.46)  is  now  a  standard  least-squares  problem  with  2m  residuals  and  m  +  n 
unknowns,  which  we  can  solve  by  using  the  techniques  in  this  chapter.  A  naive  implementa¬ 
tion  of  this  strategy  may,  however,  be  quite  expensive,  since  the  number  of  parameters  (2 n) 
and  the  number  of  observations  {m  +  n)  may  both  be  much  larger  than  for  the  original 
problem. 

Fortunately,  the  Jacobian  matrix  for  ( 10.46)  has  a  special  structure  that  can  be  ex¬ 
ploited  in  implementing  the  Gauss-Newton  or  Levenberg-Marquardt  methods.  Many  of  its 
components  are  zero;  for  instance,  we  have 


dLL 

88/ 


d[<p(tj  +  8j\ x)  -  yj] 
881 


i,  j  =  1,  2, ... ,  m,  i  /  j. 


or; 

—  =  0,  j  —  m  +  1, ,  2m,  1  =  1,2 
dxi 


and 
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Additionally,  we  have  for  j  —  1,2, ...  ,m  and  i  =  1,2, ...  ,m  that 

Brm+j  _  [  dj  if i  =y, 

35,  |  0  otherwise. 

Hence,  we  can  partition  the  Jacobian  of  the  residual  function  r  defined  by  (10.47)  into 
blocks  and  write 


J{x,  S) 


J  V 
0  D 


(10.48) 


where  V  and  D  are  m  x  m  diagonal  matrices  and  J  is  the  m  x  n  matrix  of  partial  derivatives 
of  the  functions  Wj(p(tj  +  8  p,  x)  with  respect  to  x.  Boggs,  Byrd,  and  Schnabel  [30])  apply 
the  Levenberg-Marquardt  algorithm  to  (10.46)  and  note  that  block  elimination  can  be  used 
to  solve  the  subproblems  (10.33),  (10.35)  efficiently.  Given  the  partitioning  (10.48),  we  can 
partition  the  step  vector  p  and  the  residual  vector  r  accordingly  as 


1 

_ 1 

i 

_ i 

II 

L  p s 

L  ''2  J 

and  write  the  normal  equations  ( 10.33)  in  the  partitioned  form 


jTJ  +  XI 

JTV 

Px 

JTh 

VJ 

V2  +  d2  +  xi 

Pg 

Vf}  +  Dr2 

(10.49) 


Since  the  lower  right  submatrix  V2  +  D2  +  XI  is  diagonal,  it  is  easy  to  eliminate  pg  from 
this  system  and  obtain  a  smaller  n  x  n  system  to  be  solved  for  px  alone.  The  total  cost 
of  finding  a  step  is  only  marginally  greater  than  for  the  m  x  n  problem  arising  from  the 
standard  least-squares  model. 

NOTES  AND  REFERENCES 

Algorithms  for  linear  least  squares  are  discussed  comprehensively  by  Bjorck  [29], 
who  includes  detailed  error  analyses  of  the  different  algorithms  and  software  listings.  He 
considers  not  just  the  basic  problem  (10.13)  but  also  the  situation  in  which  there  are  bounds 
(for  example,  x  >  0)  or  linear  constraints  (for  example,  Ax  >  b)  on  the  variables.  Golub 
and  Van  Loan  [136,  Chapter  5]  survey  the  state  of  the  art,  including  discussion  of  the 
suitability  of  the  different  approaches  (for  example,  normal  equations  vs.  QR  factorization) 
for  different  problem  types.  A  classical  reference  on  linear  least-squares  is  Lawson  and 
Hanson  [188]. 
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Very  large  nonlinear  least-squares  problems  arise  in  numerous  areas  of  application, 
such  as  medical  imaging,  geophysics,  economics,  and  engineering  design.  In  many  instances, 
both  the  number  of  variables  n  and  the  number  of  residuals  m  is  large,  but  it  is  also  quite 
common  that  only  m  is  large. 

The  original  description  of  the  Levenberg-Marquardt  algorithm  [190,  203]  did  not 
make  the  connection  with  the  trust-region  concept.  Rather,  it  adjusted  the  value  of  X  in 
(10.33)  directly,  increasing  or  decreasing  it  by  a  certain  factor  according  to  whether  or  not 
the  previous  trial  step  was  effective  in  decreasing  /(•).  (The  heuristics  for  adjusting  X  were 
analogous  to  those  used  for  adjusting  the  trust-region  radius  A*  in  Algorithm  4.1.)  Similar 
convergence  results  to  Theorem  10.3  can  be  proved  for  algorithms  that  use  this  approach 
(see,  for  instance,  Osborne  [231]),  independently  of  trust-region  analysis.  The  connection 
with  trust  regions  was  firmly  established  by  More  [210]. 

Wright  and  Holt  [318]  present  an  inexact  Levenberg-Marquardt  approach  for 
large-scale  nonlinear  least  squares  that  manipulates  the  parameter  X  directly  rather  than 
making  use  of  the  connection  to  trust-region  algorithms.  This  method  takes  steps  pk  that, 
analogously  to  (7.2)  and  (7.3)  in  Chapter  7,  satisfy  the  system 


\(Jk  Jk  +  Xkl)  pk  +  Jkrk\  <  VkUk  rk\\,  for  some  pk  e  [0  ,rj], 


where  ;;  e  (0,  1)  is  a  constant  and  {iik}  is  a  forcing  sequence.  A  ratio  of  actual  to  pre¬ 
dicted  decrease  is  used  to  decide  whether  the  step  pk  should  be  taken,  and  convergence 
to  stationary  points  can  be  proved  under  certain  assumptions.  The  method  can  be  imple¬ 
mented  efficiently  by  using  Algorithm  LSQR  of  Paige  and  Saunders  [234]  to  calculate  the 
approximate  solution  of  (10.35)  since,  for  a  small  marginal  cost,  this  algorithm  can  compute 
approximate  solutions  for  a  number  of  different  values  of  Xk  simultaneously.  Hence,  we  can 
compute  values  of  pk  corresponding  to  a  range  of  values  of  Xk,  and  choose  the  actual  step  to 
be  the  one  corresponding  to  the  smallest  Xk  for  which  the  actual-predicted  decrease  ratio  is 
satisfactory. 

Nonlinear  least  squares  software  is  fairly  prevalent  because  of  the  high  demand 
for  it.  Major  numerical  software  libraries  such  as  imsl,  hsl,  nag,  and  sas,  as  well  as 
programming  environments  such  as  Mathematica  and  Matlab,  contain  robust  nonlinear 
least-squares  implementations.  Other  high  quality  implmentations  include  dfnlp,  minpack, 
nl2sol,  and  nlssol;  see  More  and  Wright  [217,  Chapter  3].  The  nonlinear  programming 
packages  Lancelot,  knitro,  and  snopt  provide  large-scale  implementions  of  the  Gauss- 
Newton  and  Levenberg-Marquardt  methods.  The  orthogonal  distance  regression  algorithm 
is  implemented  by  ORDPACK  [31]. 

All  these  routines  (which  can  be  accessed  through  the  web)  give  the  user  the  option 
of  either  supplying  Jacobians  explicitly  or  else  allowing  the  code  to  compute  them  by  finite 
differencing.  (In  the  latter  case,  the  user  need  only  write  code  to  compute  the  residual  vector 
r(x);  see  Chapter  8.)  Seber  and  Wild  [280,  Chapter  15]  describe  some  of  the  important 
practical  issues  in  selecting  software  for  statistical  applications. 
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i#7  Exercises 


i#7  10.1  Let  J  be  an  m  x  n  matrix  with  m  >  n,  and  let  y  e  R"’  be  a  vector. 

(a)  Show  that  J  has  full  column  rank  if  and  only  if  JT  J  is  nonsingular. 

(b)  Show  that  J  has  full  column  rank  if  and  only  if  JT  J  is  positive  definite. 

i#7  10.2  Show  that  the  function  f(x)  in  (10.13)  is  convex. 

i#7  10.3  Show  that 

(a)  if  Q  is  an  orthogonal  matrix,  then  ||  Qx  ||  =  ||x  ||  for  any  vector  x; 

(b)  the  matrices  R  in  ( 10.15)  and  R  in  (10.17)  are  identical  if  Id  =  I,  provided  that  J  has 
full  column  rank  n . 

&  10.4 

(a)  Show  that  x*  defined  in  (10.22)  is  a  minimizer  of  (10.13). 

(b)  Find  ||x*  ||  and  conclude  that  this  norm  is  minimized  when  r,  =  0  for  all  i  with  er,  =0. 

i#7  10.5  Suppose  that  each  residual  function  r;  and  its  gradient  are  Lipschitz  continuous 

with  Lipschitz  constant  L,  that  is, 


rj{x)  —  ry'(x)||  <  L\\x  —x\\,  ||Vr7-(.r)  —  Vrj(x)||  <  L\\x  —  x 


for  all  j  —  1,2,...,  m  and  all  x,  x  G  V,  where  I?  is  a  compact  subset  of  R".  Assume  also 
that  the  rj  are  bounded  on  V,  that  is,  there  exists  M  >  0  such  that  |  r;-  (x )  \  <  M  for  all 
j  —  1,2,  ,  m  and  all  x  €  V.  Find  Lipschitz  constants  for  the  Jacobian  J  (10.3)  and  the 

gradient  V/  (10.4)  overV. 

i#7  10.6  Express  the  solution  p  of  (10.33)  in  terms  of  the  singular-value  decomposition 

of  J(x)  and  the  scalar  X.  Express  its  squared-norm  \\p\\2  in  these  same  terms,  and  show  that 


lim  p  —  - vt. 

X.^0  t—i  0: 

o 


Chapter 


Nonlinear 

Equations 


In  many  applications  we  do  not  need  to  optimize  an  objective  function  explicitly,  but  rather 
to  find  values  of  the  variables  in  a  model  that  satisfy  a  number  of  given  relationships.  When 
these  relationships  take  the  form  of  n  equalities — the  same  number  of  equality  conditions 
as  variables  in  the  model — the  problem  is  one  of  solving  a  system  of  nonlinear  equations. 
We  write  this  problem  mathematically  as 


r(x)  —  0, 


(11.1) 
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where  r  :  R”  ->  R”  is  a  vector  function,  that  is, 


r(x)  - 


n(x) 

riW 


rn(x) 


In  this  chapter,  we  assume  that  each  function  r,  :  R"  -»  R,  i  —  1,  2, . . . ,  n,  is  smooth.  A 
vector  .r*  for  which  (11.1)  is  satisfied  is  called  a  solution  or  root  of  the  nonlinear  equations. 
A  simple  example  is  the  system 


r(x)  — 


XT  —  1 


sinxi  —  X2 


=  0, 


which  is  a  system  of  n  —  2  equations  with  infinitely  many  solutions,  two  of  which  are 
x*  =  (3tt/2,  —  l)randx*  =  (jr/2,  l)r.  In  general,  the  system  (11.1)  may  have  no  solutions, 
a  unique  solution,  or  many  solutions. 

The  techniques  for  solving  nonlinear  equations  overlap  in  their  motivation,  analysis, 
and  implementation  with  optimization  techniques  discussed  in  earlier  chapters.  In  both 
optimization  and  nonlinear  equations,  Newton’s  method  lies  at  the  heart  of  many  important 
algorithms.  Features  such  as  line  searches,  trust  regions,  and  inexact  solution  of  the  linear 
algebra  subproblems  at  each  iteration  are  important  in  both  areas,  as  are  other  issues  such 
as  derivative  evaluation  and  global  convergence. 

Because  some  important  algorithms  for  nonlinear  equations  proceed  by  minimizing 
a  sum  of  squares  of  the  equations,  that  is, 


n 

min  Y  rhx), 

X  ' 

;= l 

there  are  particularly  close  connections  with  the  nonlinear  least-squares  problem  discussed 
in  Chapter  10.  The  differences  are  that  in  nonlinear  equations,  the  number  of  equations 
equals  the  number  of  variables  (instead  of  exceeding  the  number  of  variables,  as  is  typically 
the  case  in  Chapter  10),  and  that  we  expect  all  equations  to  be  satisfied  at  the  solution,  rather 
than  just  minimizing  the  sum  of  squares.  This  point  is  important  because  the  nonlinear 
equations  may  represent  physical  or  economic  constraints  such  as  conservation  laws  or 
consistency  principles,  which  must  hold  exactly  in  order  for  the  solution  to  be  meaningful. 

Many  applications  require  us  to  solve  a  sequence  of  closely  related  nonlinear  systems, 
as  in  the  following  example. 
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□  Example  1 1 A  (Rheinboldt,-  see  [21 2]) 

An  interesting  problem  in  control  is  to  analyze  the  stability  of  an  aircraft  in  response 
to  the  commands  of  the  pilot.  The  following  is  a  simplified  model  based  on  force-balance 
equations,  in  which  gravity  terms  have  been  neglected. 

The  equilibrium  equations  for  a  particular  aircraft  are  given  by  a  system  of  5  equations 
in  8  unknowns  of  the  form 


F(x)  =  Ax  +  <p(x)  —  0,  (11.2) 

where  F  :  Rs  — >  R5,  the  matrix  A  is  given  by 


-3.933 

0.107 

0.126 

0 

-9.99 

0 

-45.83 

-7.64 

0 

-0.987 

0 

-22.95 

0 

-28.37 

0 

0 

0.002 

0 

-0.235 

0 

5.67 

0 

-0.921 

-6.51 

0 

1.0 

0 

-1.0 

0 

-0.168 

0 

0 

0 

0 

-1.0 

0 

-0.196 

0 

-0.0071 

0 

and  the  nonlinear  part  is  defined  by 


<p{x)  = 


—0.727x2X3  +  8.39x3X4  —  684.4x4X5  +  63.5x4X2 
0.949xiX3  +  0.173xiX5 
—0.716xiX2  —  1.578xiX4  +  1.132x4X2 

-X1X5 

X1X4 


The  first  three  variables  xy,  X2,  X3,  represent  the  rates  of  roll,  pitch,  and  yaw,  respec¬ 
tively,  while  X4  is  the  incremental  angle  of  attack  and  x5  the  sideslip  angle.  The  last  three 
variables  xg,  X7,  xs  are  the  controls;  they  represent  the  deflections  of  the  elevator,  aileron, 
and  rudder,  respectively. 

For  a  given  choice  of  the  control  variables  X(,,  X7,  xs  we  obtain  a  system  of  5  equations 
and  5  unknowns.  If  we  wish  to  study  the  behavior  of  the  aircraft  as  the  controls  are  changed, 
we  need  to  solve  a  system  of  nonlinear  equations  with  unknowns  xi,  X2, . . . ,  x5  for  each 
setting  of  the  controls. 


Despite  the  many  similarities  between  nonlinear  equations  and  unconstrained  and 
least-squares  optimization  algorithms,  there  are  also  some  important  differences.  To  ob¬ 
tain  quadratic  convergence  in  optimization  we  require  second  derivatives  of  the  objective 
function,  whereas  knowledge  of  the  first  derivatives  is  sufficient  in  nonlinear  equations. 


Figure  11.1  The  function  r(x)  —  sin(5x)  —  x  has  three  roots. 


Quasi-Newton  methods  are  perhaps  less  useful  in  nonlinear  equations  than  in  optimiza¬ 
tion.  In  unconstrained  optimization,  the  objective  function  is  the  natural  choice  of  merit 
function  that  gauges  progress  towards  the  solution,  but  in  nonlinear  equations  various  merit 
functions  can  be  used,  all  of  which  have  some  drawbacks.  Line  search  and  trust-region  tech¬ 
niques  play  an  equally  important  role  in  optimization,  but  one  can  argue  that  trust-region 
algorithms  have  certain  theoretical  advantages  in  solving  nonlinear  equations. 

Some  of  the  difficulties  that  arise  in  trying  to  solve  nonlinear  equations  can  be 
illustrated  by  a  simple  scalar  example  ( n  —  1).  Suppose  we  have 

r(x)  —  sin(5.r)  —  x,  (11-3) 

as  plotted  in  Figure  11.1.  From  this  figure  we  see  that  there  are  three  solutions  of  the 
problem  r(x)  —  0,  also  known  as  roots  ofr ,  located  at  zero  and  approximately  ±0.519148. 
This  situation  of  multiple  solutions  is  similar  to  optimization  problems  where,  for  example, 
a  function  may  have  more  than  one  local  minimum.  It  is  not  quite  the  same,  however:  In 
the  case  of  optimization,  one  of  the  local  minima  may  have  a  lower  function  value  than 
the  others  (making  it  a  “better”  solution),  while  in  nonlinear  equations  all  solutions  are 
equally  good  from  a  mathematical  viewpoint.  (If  the  modeler  decides  that  the  solution 
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found  by  the  algorithm  makes  no  sense  on  physical  grounds,  their  model  may  need  to  be 
reformulated.) 

In  this  chapter  we  start  by  outlining  algorithms  related  to  Newton’s  method  and 
examining  their  local  convergence  properties.  Besides  Newton’s  method  itself,  these  in¬ 
clude  Broyden’s  quasi-Newton  method,  inexact  Newton  methods,  and  tensor  methods. 
We  then  address  global  convergence,  which  is  the  issue  of  trying  to  force  convergence  to 
a  solution  from  a  remote  starting  point.  Finally,  we  discuss  a  class  of  methods  in  which 
an  “easy”  problem — one  to  which  the  solution  is  well  known — is  gradually  transformed 
into  the  problem  F(x)  —  0.  In  these  so-called  continuation  (or  homotopy)  methods,  we 
track  the  solution  as  the  problem  changes,  with  the  aim  of  finishing  up  at  a  solution  of 
F(x)  —  0. 

Throughout  this  chapter  we  make  the  assumption  that  the  vector  function  r  is  con¬ 
tinuously  differentiable  in  the  region  V  containing  the  values  of  x  we  are  interested  in.  In 
other  words,  the  Jacobian  J  (x)  (the  matrix  of  first  partial  derivatives  of  r(x)  defined  in  the 
Appendix  and  in  (10.3))  exists  and  is  continuous.  We  say  that  x*  satisfying  r(x*)  —  0  is  a 
degenerate  solution  if  J(x*)  is  singular,  and  a  nondegenerate  solution  otherwise. 


1 1 A  LOCAL  ALGORITHMS 

NEWTON’S  METHOD  FOR  NONLINEAR  EQUATIONS 

Recall  from  Theorem  2.1  that  Newton’s  method  for  minimizing  /  :  R"  ->  R  forms  a 
quadratic  model  function  by  taking  the  first  three  terms  of  the  Taylor  series  approximation 
of  /  around  the  current  iterate  xk .  The  Newton  step  is  the  vector  that  minimizes  this  model. 
In  the  case  of  nonlinear  equations,  Newton’s  method  is  derived  in  a  similar  way,  but  with  a 
linear  model,  one  that  involves  function  values  and  first  derivatives  of  the  functions  r,  (x), 
i  —  1,  2,  ... ,  m  at  the  current  iterate  xk.  We  justify  this  strategy  by  referring  to  the  following 
multidimensional  variant  of  Taylor’s  theorem. 

Theorem  11.1. 

Suppose  that  r  :  R"  ->  R"  is  continuously  differentiable  in  some  convex  open  set  V  arid 
thatx  and  x  +  p  are  vectors  in  V.  We  then  have  that 


r(x  +  p )  =  r(x)  + 


J(x  +  tp)p  dt. 


(11.4) 


We  can  define  a  linear  model  Mk{p)  of  r(xk  +  p)  by  approximating  the  second  term  on  the 
right-hand-side  of  (1 1.4)  by  J{x)p,  and  writing 


Mk(p)  =  r(x, t)  +  J{xk)p. 


(11.5) 
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Newton’s  method,  in  its  pure  form,  chooses  the  step  pk  to  be  the  vector  for  which  Mk(pk)  = 
0,  that  is,  pk  —  —J{xk)~lr{xk).  We  define  it  formally  as  follows. 

Algorithm  1 1.1  (Newton’s  Method  for  Nonlinear  Equations). 

Choose  xq\ 

for  k  —  0,  1,  2, . . . 

Calculate  a  solution  p k  to  the  Newton  equations 


J(*k)Pk  =  ~r(xk);  (11.6) 

Xk+\  •*-  %k  +  Pk'i 

end  (for) 

We  use  a  linear  model  to  derive  the  Newton  step,  rather  than  a  quadratic  model  as  in 
unconstrained  optimization,  because  the  linear  model  normally  has  a  solution  and  yields  an 
algorithm  with  rapid  convergence  properties.  In  fact,  Newton’s  method  for  unconstrained 
optimization  (see  (2.15))  can  be  derived  by  applying  Algorithm  11.1  to  the  nonlinear 
equations  V fix)  —  0.  We  see  also  in  Chapter  18  that  sequential  quadratic  programming 
for  equality- constrained  optimization  can  be  derived  by  applying  Algorithm  11.1  to  the 
nonlinear  equations  formed  by  the  first-order  optimality  conditions  (18.3)  for  this  problem. 
Another  connection  is  with  the  Gauss-Newton  method  for  nonlinear  least  squares;  the 
formula  (11.6)  is  equivalent  to  (10.23)  in  the  usual  case  in  which  J(xk)  is  nonsingular. 

When  the  iterate  Xk  is  close  to  a  nondegenerate  root  x*,  Newton’s  method  converges 
superlinearly,  as  we  show  in  Theorem  11.2  below.  Potential  shortcomings  of  the  method 
include  the  following. 

•  When  the  starting  point  is  remote  from  a  solution,  Algorithm  11.1  can  behave 
erratically.  When  J(xk)  is  singular,  the  Newton  step  may  not  even  be  defined. 

•  First-derivative  information  (the  Jacobian  matrix  J)  may  be  difficult  to  obtain. 

•  It  may  be  too  expensive  to  find  and  calculate  the  Newton  step  p^  exactly  when  n  is 
large. 

•  The  root  x*  in  question  may  be  degenerate,  that  is,  J  (x* )  may  be  singular. 

An  example  of  a  degenerate  problem  is  the  scalar  function  r(x)  —  x2,  which  has  a  single 
degenerate  root  at  x*  =  0.  Algorithm  11.1,  when  started  from  anynonzero.ro,  generates  the 
sequence  of  iterates 

1 

Xk  =  -fX0, 

which  converges  to  the  solution  0,  but  only  at  a  linear  rate. 

As  we  show  later  in  this  chapter,  Newton’s  method  can  be  modified  and  enhanced  in 
various  ways  to  get  around  most  of  these  problems.  The  variants  we  describe  form  the  basis 
of  much  of  the  available  software  for  solving  nonlinear  equations. 
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We  summarize  the  local  convergence  properties  of  Algorithm  11.1  in  the  following 
theorem.  For  part  of  this  result,  we  make  use  of  a  Lipschitz  continuity  assumption  on  the 
Jacobian,  by  which  we  mean  that  there  is  a  constant  P i  such  that 

II -/(-Vo)  -  /(xi)||  <  Pl\\x0  -  xi||,  (11.7) 

for  all  Xq  and  X\  in  the  domain  in  question. 

Theorem  1 1 .2. 

Suppose  thatr  is  continuously  differentiable  in  a  convex  open  setV  C  1R".  Let  x*  e  V 
be  a  nondegenerate  solution  ofr(x)  =  0,  and  let  {x^.}  be  the  sequence  of  iterates  generated  by 
Algorithm  11.1.  Then  when  xk  e  V  is  sufficiently  close  to  x*,  we  have 


X/c+l  ~  x*  =  0(||X*  -  x*  ||),  (11.8) 

indicating  local  Q-superlinear  convergence.  When  r  is  Lipschitz  continuously  differentiable 
nearx*,  we  have  for  allxk  sufficiently  close  to  x*  that 


Xlc+l  ~  x*  —  0(||x*  -  x*||2), 


(11.9) 


indicating  local  Q-quadratic  convergence. 

PROOF.  Since  r  (x*)  =  0,  we  have  from  Theorem  11.1  that 

r(xk)  —  r(xk)  -  r(x*)  —  J(xk)(xk  -  x*)  +  w(xk,x*),  (11.10) 


where 

w{xk,x*)=  [  [J(xk  +  t(x*  -  xk))  -  J(xk)]  (xk  -  x*). 
Jo 

From  (A.  12)  and  continuity  of  J ,  we  have 


(11.11) 


MX*;,X*)||  = 


J' 

<-f 


[J(x*  +  t(x*  —  xk ))  —  J{xk)](xk  —  x*)dt 


II /(x*  +  T(x*  —  x*))  —  J(xk) ||  ||x,t  —  x* ||  dt  (11.12) 
=  o(  \\xk  -  X*  II). 

Since  7(x*)  is  nonsingular,  there  is  a  radius  8  >  0  and  a  positive  constant  fl*  such  that  for 
all  x  in  the  ball  B{x*,  S)  defined  by 


B(x*,8)  =  {x  |  1 1 x  —  x* j|  <  S}, 


(11.13) 
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we  have  that 

II-/(.t)_1||  <  P*  and  xeV.  (11.14) 

Assuming  that  xk  e  B(x *,  S ),  and  recalling  the  definition  ( 1 1.6),  we  multiply  both  sides  of 
(11.10)  by  J{xk)~l  to  obtain 

~Pk  =  (■ xk  -  x*)  +  \\J(xk)~l\\o{\\xk  -  JC*||), 

=A  xk  + Pk~  x*  =  o{\\xk- x*\\), 

=>•  xk+i-x* —  o(\\xk-x*\\),  (11.15) 


which  yields  (11.8). 

When  the  Lipschitz  continuity  assumption  ( 1 1.7)  is  satisfied,  we  can  obtain  a  sharper 
estimate  for  the  remainder  term  w(xk,  x*)  defined  in  (11.11).  By  using  (11.7)  in  (11.12),  we 
obtain 


||  w (xk,  x*') ||  =  0(\\xk  -  x*\\2). 

By  multiplying  ( 1 1.10)  by  J {xk)~l  as  above,  we  obtain 

~Pk  ~  (xk  -  x*)  —  J{xkYlw(xk,  v*), 
so  the  estimate  (11.9)  follows  as  in  (11.15). 

INEXACT  NEWTON  METHODS 

Instead  of  solving  (11.6)  exactly,  inexact  Newton  methods  use  search  directions  pk 
that  satisfy  the  condition 

Ik*  +  JkPkW  <  »?*lk*ll.  for  some  r)k  e  [0, 17],  (11-17) 

where  i]  e  [0,  1)  is  a  constant.  As  in  Chapter  7,  we  refer  to  {ijk}  as  the  forcing  sequence. 
Different  methods  make  different  choices  of  the  forcing  sequence,  and  they  use  different 
algorithms  for  finding  the  approximate  solutions  pk.  The  general  framework  for  this  class 
of  methods  can  be  stated  as  follows. 

Framework  11.2  (Inexact  Newton  for  Nonlinear  Equations). 

Given  rj  e  [0,  1); 

Choose  xq; 

for  k  —  0,  1,  2,  . . . 

Choose  forcing  parameter  qk  e  [0,  ?;]; 

Find  a  vector  pk  that  satisfies  (11.17); 

Xk+ 1  xk  +  pk; 

end  (for) 


(11.16) 


□ 


278  Chapter  11.  Nonlinear  Equations 


The  convergence  theory  for  these  methods  depends  only  on  the  condition  (11.17) 
and  not  on  the  particular  technique  used  to  calculate  pk.  The  most  important  methods 
in  this  class,  however,  make  use  of  iterative  techniques  for  solving  linear  systems  of  the 
form  Jp  =  — r,  such  as  gmres  (Saad  and  Schultz  [273],  Walker  [302])  or  other  Krylov- 
space  methods.  Like  the  conjugate-gradient  algorithm  of  Chapter  5  (which  is  not  directly 
applicable  here,  since  the  coefficient  matrix  J  is  not  symmetric  positive  definite),  these 
methods  typically  require  us  to  perform  a  matrix-vector  multiplication  of  the  form  Jd  for 
some  d  at  each  iteration,  and  to  store  a  number  of  work  vectors  of  length  n.  gmres  requires 
an  additional  vector  to  be  stored  at  each  iteration,  so  must  be  restarted  periodically  (often 
every  10  or  20  iterations)  to  keep  memory  requirements  at  a  reasonable  level. 

The  matrix-vector  products  Jd  can  be  computed  without  explicit  knowledge  of  the 
Jacobian  J.  A  finite-difference  approximation  to  Jd  that  requires  one  evaluation  of  r(-) 
is  given  by  the  formula  (8.11).  Calculation  of  Jd  exactly  (at  least,  to  within  the  limits  of 
finite-precision  arithmetic)  can  be  performed  by  using  the  forward  mode  of  automatic 
differentiation,  at  a  cost  of  at  most  a  small  multiple  of  an  evaluation  of  r(-).  Details  of  this 
procedure  are  given  in  Section  8.2. 

We  do  not  discuss  the  iterative  methods  for  sparse  linear  systems  here,  but  refer 
the  interested  reader  to  Kelley  [177]  and  Saad  [272]  for  comprehensive  descriptions  and 
implementations  of  the  most  interesting  techniques.  We  prove  a  local  convergence  theorem 
for  the  method,  similar  to  Theorem  1 1.2. 

Theorem  11.3. 

Suppose  thatr  is  continuously  differentiable  in  a  convex  open  setV  C  IR".  Let  x*  e  V 
be  a  nondegenerate  solution  of  r{x)  —  0,  and  let  {ay.}  be  the  sequence  of  iterates  generated  by 
the  Framework  11.2.  Then  whenxk  e  V  is  sufficiently  close  to  a*,  the  following  are  true: 

(i)  Ifr)  in  (11.17)  is  sufficiently  small,  the  convergence  of  {a* }  tox*  is  Q-linear. 

(ii)  Ifr/k  -»  0,  the  convergence  is  Q-superlinear. 

(iii)  If  in  addition,  J (•)  is  Lipschitz  continuous  in  a  neighborhood  ofx*  and  qk  =  O ( ||r* || ), 

the  convergence  is  Q-quadratic. 

PROOF.  We  first  rewrite  ( 1 1 . 1 7 )  as 


J(xk)pk  +  r(xk)  =  vk,  where  ||u*||  <  t7*||r(a*)||.  (11.18) 

Since  a*  is  a  nondegenerate  root,  we  have  as  in  (1 1.14)  that  there  is  a  radius  S  >  0  such  that 
||./(a)_1||  <  P*  f°r  some  constant  P*  and  all  a  €  B(x* ,  <5).  By  multiplying  both  sides  of 
(11.18)  by  J(xk)~l  and  rearranging,  we  find  that 

|  Pk  +  J{xk)~lr(xk)\  -  ||/(**)-1Uit||  <  P*rik\\r(Xk)\\. 


(11.19) 
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As  in  ( 1 1 . 10),  we  have  that 

r(x)  —  J(x)(x  —  x*)  +  w(x,  x*),  (11.20) 

where  p(x)  =  \\w(x,  x*)||/||x  —  x*||  — >  Oasx  — >  x*.  By  reducing  S  if  necessary,  we  have 
from  this  expression  that  the  following  bound  holds  for  all  x  e  B(x*,  S ): 

||r(x)||  <  2 1| /(x*)||  ||x  —  x*||  +o{\\x  —  x*||)  <  4||/(x*)||  ||x  -x*\\.  (11.21) 

We  now  set  x  =  Xk  in  (11.20),  and  use  (11.19)  and  (11.21)  to  obtain 

\\Xk  +  Pk~  x*\\  =  \\pk  +  J(xk)~l(r(xk)  -  w(xk,  x*))\\ 

<  P*r]k\\r{xk)\\  +  ||7(xi)_1||||w(xA-,x*)|| 

<  [4\\J(x*)\\P*rik  +  P*p(xk)]\\xk-x*l  (11.22) 

By  choosing  xk  close  enough  to  x*  that  p{xk)  <  1/(4/!*),  and  choosing  ?;  = 

1/(8 1|  7 (x*)  ||/3*),  we  have  that  the  term  in  square  brackets  in  (11.22)  is  at  most  1/2.  Hence, 
since  xk+\  —  xk  +  pk,  this  formula  indicates  Q-linear  convergence  of  {xk}  to  x*,  proving 
part  (i). 

Part  (ii)  follows  immediately  from  the  fact  that  the  term  in  brackets  in  (1 1.22)  goes  to 
zero  as  xk  -*  x*  and  r]k  —>  0.  For  part  (iii),  we  combine  the  techniques  above  with  the  logic 
of  the  second  part  of  the  proof  of  Theorem  1 1.2.  Details  are  left  as  an  exercise.  □ 


BROYDEN’S  METHOD 

Secant  methods,  also  known  as  quasi-Newton  methods,  do  not  require  calculation  of 
the  Jacobian  /  (x ) .  Instead,  they  construct  their  own  approximation  to  this  matrix,  updating 
it  at  each  iteration  so  that  it  mimics  the  behavior  of  the  true  Jacobian  J  over  the  step  just 
taken.  The  approximate  Jacobian,  which  we  denote  at  iteration  k  by  Bk,  is  then  used  to 
construct  a  linear  model  analogous  to  (11.5),  namely 

Mk(p)  =  r{xk)  +  Bkp.  (11.23) 

We  obtain  the  step  by  setting  this  model  to  zero.  When  Bk  is  nonsingular,  we  have  the 
following  explicit  formula  (cf.  (11.6)): 

Pk  =  -Bp'r(xk).  (11.24) 

The  requirement  that  the  approximate  Jacobian  should  mimic  the  behavior  of  the 
true  Jacobian  can  be  specified  as  follows.  Let  sk  denote  the  step  from  xk  to  xk+i,  and  let  yk 
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be  the  corresponding  change  in  r,  that  is, 

sk  =  xk+i  -  xk,  yk  —  r(xk+ 1)  -  r{xk). 
From  Theorem  1 1. 1,  we  have  that  sk  and  yk  are  related  by  the  expression 


/' 


yk  —  /  J{xk  +  tsk)skdt  ss  J(xk+l)sk  +  o(ll^-ll). 


(11.25) 


(11.26) 


We  require  the  updated  Jacobian  approximation  Bk+l  to  satisfy  the  following  equation, 
which  is  known  as  the  secant  equation, 


yk  —  Bk+\sk, 


(11.27) 


which  ensures  that  Bk+l  and  J(xk+ 1)  have  similar  behavior  along  the  direction  sk.  (Note 
the  similarity  with  the  secant  equation  (6.6)  in  quasi-Newton  methods  for  unconstrained 
optimization;  the  motivation  is  the  same  in  both  cases.)  The  secant  equation  does  not  say 
anything  about  how  Bk+ i  should  behave  along  directions  orthogonal  to  sk.  In  fact,  we  can 
view  (11.27)  as  a  system  of  n  linear  equations  in  n2  unknowns,  where  the  unknowns  are 
the  components  of  Bk+ 1,  so  for  n  >  1  the  equation  (11.27)  does  not  determine  all  the 
components  of  Bk+\  uniquely.  (The  scalar  case  of  n  —  1  gives  rise  to  the  scalar  secant 
method;  see  (A.60).) 

The  most  successful  practical  algorithm  is  Broyden’s  method,  for  which  the  update 
formula  is 


Bk+ 1  —  Bk  + 


(yk  ~  Bksk)sk 
s£sk 


(11.28) 


The  Broyden  update  makes  the  smallest  possible  change  to  the  Jacobian  (as  measured  by  the 
Euclidean  norm  \\Bk  —  Bk+ 1 H2)  that  is  consistent  with  (1 1.27),  as  we  show  in  the  following 
Lemma. 

Lemma  11.4  (Dennis  and  Schnabel  [92,  Lemma  8.1.1]). 

Among  all  matrices  B  satisfying  Bsk  —  yk,  the  matrix  Bk+i  defined  by  (11.28)  minimizes 
the  difference  ||  B  —  Bk  || . 


PROOF.  Let  B  be  any  matrix  that  satisfies  Bsk  =  yk.  By  the  properties  of  the  Euclidean 
norm  (see  (A.  10))  and  the  fact  that  ||ijr/s':rs'||  =  1  for  any  vector  s  (see  Exercise  11.1),  we 
have 


ll#/t+i  —  Bk 


{yk  -  Bksk)sf 
sfsk 

( B  -  Bk)sksf 
sfisk 


<  H5-5a.ii 

sksf 

T 

sk  sk 

B  —  Bk 
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Hence,  we  have  that 


Bk+ 1  e  arg  min  ||Z?  —  Bk 

B:yk=Bsk 


and  the  result  is  proved.  □ 

In  the  specification  of  the  algorithm  below,  we  allow  a  line  search  to  be  performed 
along  the  search  direction  pk,  so  that  sk  =  apk  for  some  a  >  0  in  the  formula  ( 1 1.25).  (See 
below  for  details  about  line-search  methods.) 

Algorithm  11.3  (Broyden). 

Choose  Xo  and  a  nonsingular  initial  Jacobian  approximation  B0; 
for  k  =  0,  1,  2,  . . . 

Calculate  a  solution  pk  to  the  linear  equations 


BkPk  —  -r(xk); 


(11.29) 


Choose  a.k  by  performing  a  line  search  along  pk; 

Xk+ 1  4  Xfc  +  (XkPki 
Sjc  ^  %k+ 1  X/c> 

yk  a-  r(xk+ 1)  -  rte); 

Obtain  Bk+ 1  from  the  formula  (11.28); 

end  (for) 

Under  certain  assumptions,  Broyden’s  method  converges  superlinearly,  that  is, 

II *k+i  -  x* ||  =  o{\\xk  -  x*||).  (11.30) 


This  local  convergence  rate  is  fast  enough  for  most  practical  purposes,  though  not  as  fast  as 
the  Q-quadratic  convergence  of  Newton’s  method. 

We  illustrate  the  difference  between  the  convergence  rates  of  Newton’s  and  Broyden’s 
method  with  a  small  example.  The  function  r  :  R2  -»  R2  defined  by 


r{x)  — 


(xi  +  3)(x23  —  7)  +  18 
sin(x2eXi  —  1) 


(11.31) 


has  a  nondegenerate  root  at  x*  =  (0,  l)r.  We  start  both  methods  from  the  point 
xo  =  (—0.5,  1.4)r,  and  use  the  exact  Jacobian  J(x o)  at  this  point  as  the  initial  Jacobian 
approximation  B0.  Results  are  shown  in  Table  11.1. 

Newton’s  method  clearly  exhibits  Q-quadratic  convergence,  which  is  characterized  by 
doubling  of  the  exponent  of  the  error  at  each  iteration.  Broyden’s  method  takes  twice  as 
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Table  11.1  Convergence  of  Iterates  in  Broyden 
and  Newton  Methods 


Iteration  k 

II**  - 

Broyden 

**ll2 

Newton 

0 

0.64  x  10° 

0.64  x  10° 

1 

0.62  x  10-1 

0.62  x  lO^1 

2 

0.52  x  10~3 

0.21  x  10~3 

3 

0.25  x  10~3 

0.18  x  10~7 

4 

0.43  x  10^4 

0.12  x  10~15 

5 

0.14  x  10~6 

6 

0.57  x  10~9 

7 

0.18  x  10~n 

8 

0.87  x  10-15 

Table  11.2  Convergence  of  Function  Norms  in 
Broyden  and  Newton  Methods 


lk(**)ll2 

Iteration  k 

Broyden 

Newton 

0 

0.74  x  101 

0.74  x  101 

1 

0.59  x  10° 

0.59  x  10° 

2 

0.20  x  10~2 

0.23  x  10~2 

3 

0.21  x  10~2 

0.16  x  10~6 

4 

0.37  x  10~3 

0.22  x  10~15 

5 

0.12  x  10~5 

6 

0.49  x  10~8 

7 

0.15  x  10~10 

8 

0.11  x  10~18 

many  iterations  as  Newton’s,  and  reduces  the  error  at  a  rate  that  accelerates  slightly  towards 
the  end.  The  function  norms  ||r(x/t)||  approach  zero  at  a  similar  rate  to  the  iteration  errors 
|| xk  —  x*  || .  As  in  (11.10),  we  have  that 

r(xk)  —  r(xk)  -  r{x*)  J(x*){xk  -  x*), 

so  by  nonsingularity  of  J{x*),  the  norms  of  r(xk)  and  (xk  —  x*)  are  bounded  above  and 
below  by  multiples  of  each  other.  For  our  example  problem  (11.31),  convergence  of  the 
sequence  of  function  norms  in  the  two  methods  is  shown  in  Table  11.2. 

The  convergence  analysis  of  Broyden’s  method  is  more  complicated  than  that  of 
Newton’s  method.  We  state  the  following  result  without  proof. 

Theorem  11.5. 

Suppose  the  assumptions  of  Theorem  11.2  hold.  Then  there  are  positive  constants  e  and 
S  such  that  if  the  starting  point  xq  and  the  starting  approximate  Jacobian  B0  satisfy 

ll*o -*!<«.  II Bo  -/(**)«  <e,  (11-32) 
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the  sequence  {x i,}  generated  by  Broyden’s  method  (11.24),  (11.28)  is  well-defined  and  converges 
Q-superlinearly  to  x*. 

The  second  condition  in  (11.32) — that  the  initial  Jacobian  approximation  B0  must 
be  close  to  the  true  Jacobian  at  the  solution  /(x*) — is  difficult  to  guarantee  in  practice. 
In  contrast  to  the  case  of  unconstrained  minimization,  a  good  choice  of  B0  can  be  crit¬ 
ical  to  the  performance  of  the  algorithm.  Some  implementations  of  Broyden’s  method 
recommend  choosing  B0  to  be  J{x o),  or  some  finite-difference  approximation  to  this 
matrix. 

The  Broyden  matrix  Bk  will  be  dense  in  general,  even  if  the  true  Jacobian  J  is  sparse. 
Therefore,  when  n  is  large,  an  implementation  of  Broyden’s  method  that  stores  B k  as  a  full 
n  x  n  matrix  may  be  inefficient.  Instead,  we  can  use  limited-memory  methods  in  which 
Bk  is  stored  implicitly  in  the  form  of  a  number  of  vectors  of  length  n,  while  the  system 
(1 1.29)  is  solved  by  a  technique  based  on  application  of  the  Sherman-Morrison-Woodbury 
formula  (A.28).  These  methods  are  similar  to  the  ones  described  in  Chapter  7  for  large-scale 
unconstrained  optimization. 


TENSOR  METHODS 

In  tensor  methods,  the  linear  model  Mk(p )  used  by  Newton’s  method  ( 1 1.5)  is  aug¬ 
mented  with  an  extra  term  that  aims  to  capture  some  of  the  nonlinear,  higher-order, 
behavior  of  r.  By  doing  so,  it  achieves  more  rapid  and  reliable  convergence  to  degenerate 
roots,  in  particular,  to  roots  x*  for  which  the  Jacobian  /(x*)  has  rank  n  —  1  or  n  —  2. 
We  give  a  broad  outline  of  the  method  here,  and  refer  to  Schnabel  and  Frank  [277]  for 
details. 

We  use  Mk  (p)  to  denote  the  model  function  on  which  tensor  methods  are  based;  this 
function  has  the  form 


Mk(p)  —  r(xk )  +  J{xk)p  +  \Tkpp ,  (11.33) 

where  Tk  is  a  tensor  defined  by  n3  elements  ( Tk)iji  whose  action  on  a  pair  of  arbitrary  vectors 
u  and  v  in  R"  is  defined  by 


(Tkuv)i  =  Y[  YSWij'UjV,. 
j= i  i= i 

If  we  followed  the  reasoning  behind  Newton’s  method,  we  could  consider  building  Tk  from 
the  second  derivatives  of  r  at  the  point  xk,  that  is, 


(Tk)ij H  -  [V2r,(x*)]7-;. 
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For  instance,  in  the  example  (11.31),  we  have  that 


(T(x)uv)  i  =  uTV2r\{x)v  —  uT 


0  3*2 

3x22  6x2{xk  +  3) 


—  3  Xy(U\V2  “E  w  2  tt  i )  ~E  6x2(x  ]  -f-  3)u2v2. 


v 


However,  use  of  the  exact  second  derivatives  is  not  practical  in  most  instances.  If  we  were  to 
store  this  information  explicitly,  about  n3/ 2  memory  locations  would  be  needed,  about  n 
times  the  requirements  of  Newton’s  method.  Moreover,  there  may  be  no  vector  p  for  which 
Mk{p)  =  0,  so  the  step  may  not  even  be  defined. 

Instead,  the  approach  described  in  [277]  defines  7).  in  a  way  that  requires  little 
additional  storage,  but  which  gives  M k  some  potentially  appealing  properties.  Specifically, 
Tk  is  chosen  so  that  Mk(p)  interpolates  the  function  r(xk  +  p)  at  some  previous  iterates 
visited  by  the  algorithm.  That  is,  we  require  that 


Mk(xk-j  ~  xk)  —  r(xk-j),  for  j  —  1,2, ...  ,q,  (11.34) 

for  some  integer  q  >  0.  By  substituting  from  (11.33),  we  see  that  Tk  must  satisfy  the 
condition 


\Tksjksjk  =  r (xk-j )  -r{xk)~  J(xk)sjk, 


where 


Sjk  —  x k—j  xkt  j  —  1,2, ...  ,q. 

In  [277]  it  is  shown  that  this  condition  can  be  ensured  by  choosing  Tk  so  that  its  action  on 
arbitrary  vectors  u  and  v  is 


9 

Tkuv  =  '^2aj{sJku)(sJk  v), 
j=  i 

where  cij,  j  —  1,2, ...  ,q,  are  vectors  of  length  n.  The  number  of  interpolating  points  q 
is  typically  chosen  to  be  quite  modest,  usually  less  than  y/n.  This  Tk  can  be  stored  in  2 nq 
locations,  which  contain  the  vectors  aj  and  sjk  for  j  —  1,2 , ....  q.  Note  the  connection 
between  this  idea  and  Broyden’s  method,  which  also  chooses  information  in  the  model 
(albeit  in  the  first-order  part  of  the  model)  to  interpolate  the  function  value  at  the  previous 
iterate. 

This  technique  can  be  refined  in  various  ways.  The  points  of  interpolation  can  be 
chosen  to  make  the  collection  of  directions  Sjk  more  linearly  independent.  There  may  still 
not  be  a  vector  p  for  which  Mk  (p)  =  0,  but  we  can  instead  take  the  step  to  be  the  vector  that 
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minimizes  \\Mk(p)\\\,  which  can  be  found  by  using  a  specialized  least-squares  technique. 
There  is  no  assurance  that  the  step  obtained  in  this  way  is  a  descent  direction  for  the  merit 
function  \  ||  r  (x)  || 2  (which  is  discussed  in  the  next  section),  and  in  this  case  it  can  be  replaced 
by  the  standard  Newton  direction  —  Jflrk. 


1 1 .2  PRACTICAL  METHODS 


We  now  consider  practical  variants  of  the  Newton-like  methods  discussed  above,  in  which 
line-search  and  trust-region  modifications  to  the  steps  are  made  in  order  to  ensure  better 
global  convergence  behavior. 

MERIT  FUNCTIONS 

As  mentioned  above,  neither  Newton’s  method  (11.6)  nor  Broyden’s  method  (11.24), 
( 1 1 .28)  with  unit  step  lengths  can  be  guaranteed  to  converge  to  a  solution  of  r  (x)  =  0  unless 
they  are  started  close  to  that  solution.  Sometimes,  components  of  the  unknown  or  function 
vector  or  the  Jacobian  will  blow  up.  Another,  more  exotic,  kind  of  behavior  is  cycling,  where 
the  iterates  move  between  distinct  regions  of  the  parameter  space  without  approaching  a 
root.  An  example  is  the  scalar  function 

r(x )  =  —x5  +  x3  +  4x, 

which  has  five  nondegenerate  roots.  When  started  from  the  point  xo  =  1,  Newton’s  method 
produces  a  sequence  of  iterates  that  oscillates  between  1  and  —  1  (see  Exercise  1 1.3)  without 
converging  to  any  of  the  roots. 

The  Newton  and  Broyden  methods  can  be  made  more  robust  by  using  line-search  and 
trust-region  techniques  similar  to  those  described  in  Chapters  3  and  4.  Before  describing 
these  techniques,  we  need  to  define  a  merit  function,  which  is  a  scalar- valued  function  of  x 
that  indicates  whether  a  new  iterate  is  better  or  worse  than  the  current  iterate,  in  the  sense  of 
making  progress  toward  a  root  of  r.  In  unconstrained  optimization,  the  objective  function 
/  is  itself  a  natural  merit  function;  most  algorithms  for  minimizing  /  require  a  decrease 
in  /  at  each  iteration.  In  nonlinear  equations,  the  merit  function  is  obtained  by  combining 
the  n  components  of  the  vector  r  in  some  way. 

The  most  widely  used  merit  function  is  the  sum  of  squares,  defined  by 

n 

f(x)=  jlk(x)||2  =  i^V2(x).  (11.35) 

i=i 

(The  factor  1  /2  is  introduced  for  convenience.)  Any  root  x*  of  r  obviously  has  /(x*)  =  0, 
and  since  /(x)  >  0  for  all  x,  each  root  is  a  minimizer  of  /.  However,  local  minimizers  of 
/  are  not  roots  of  r  if  /  is  strictly  positive  at  the  point  in  question.  Still,  the  merit  function 
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Figure  11.2  Plot  of  ^  [sin(5x)  —  x]2,  showing  its  many  local  minima. 


( 1 1.35)  has  been  used  successfully  in  many  applications  and  is  implemented  in  a  number  of 
software  packages. 

The  merit  function  for  the  example  (11.3)  is  plotted  in  Figure  11.2.  It  shows  three 
local  minima  corresponding  to  the  three  roots,  but  there  are  many  other  local  minima  (for 
example,  those  at  around  ±1.53053).  Local  minima  like  these  that  are  not  roots  of  /  satisfy 
an  interesting  property.  Since 

V  /(x*)  —  J  {x*)T  r{x*)  =  0,  (11.36) 


we  can  have  r(x*)  ^  0  only  if  J(x*)  is  singular. 

Since  local  minima  for  the  sum-of-squares  merit  function  maybe  points  of  attraction 
for  the  algorithms  described  in  this  section,  global  convergence  results  for  the  algorithms 
discussed  here  are  less  satisfactory  than  for  similar  algorithms  applied  to  unconstrained 
optimization. 

Other  merit  functions  are  also  used  in  practice.  One  such  is  the  l  ^  norm  merit  function 
defined  by 


f\(x)  —  ||r(x)||1  =  ^  |r,-(x)|. 

i=i 

This  function  is  studied  in  Chapters  17  and  18  in  the  context  of  algorithms  for  constrained 
optimization. 
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LINE  SEARCH  METHODS 

We  can  obtain  algorithms  with  global  convergence  properties  by  applying  the  line- 
search  approach  of  Chapter  3  to  the  sum-of-squares  merit  function  f(x)  —  ^||r(.r)||2. 
When  it  is  well  defined,  the  Newton  step 

J(xk)pk  =  -r(xk)  (11.37) 

is  a  descent  direction  for  /(•)  whenever  rk  ^  0,  since 


pTkVf(xk)  =  -pfjf rk  =  -\\rk\\2  <  0.  (11.38) 

Step  lengths  ak  are  chosen  by  one  of  the  procedures  of  Chapter  3,  and  the  iterates  are  defined 
by  the  formula 


xk+i  =  xk  +  akpk,  k  —  0,1,2, -  (11.39) 

For  the  case  of  line  searches  that  choose  ak  to  satisfy  the  Wolfe  conditions  (3.6),  we  have  the 
following  convergence  result,  which  follows  directly  from  Theorem  3.2. 

Theorem  1 1 .6. 

Suppose  that  J(x)  is  Lipschitz  continuous  in  a  neighborhood  V  of  the  level  set  C  = 
{x  :  f(x)  <  f{x o)},  and  that  ||/(x)||  and  ||r(x)||  are  bounded  above  on  V.  Suppose  that  a 
line-search  algorithm  (11.39)  is  applied  to  f,  where  the  search  directions  pk  satisfy  pfV  fk  <  0 
while  the  step  lengths  ak  satisfy  the  Wolfe  conditions  (3.6).  Then  we  have  that  the  Zoutendijk 
condition  holds,  that  is, 


J2cos2ek\\jf  rk ||2  <  oo, 
k>  0 


where 


cos  9k  — 


-Pk^fixk) 

IIW-lll|V/(.r,)||- 


(11.40) 


We  omit  the  proof,  which  verifies  that  V  /  is  Lipschitz  continuous  on  V  and  that  /  is 
bounded  below  (by  0)  on  V,  and  then  applies  Theorem  3.2. 

Provided  that  the  sequence  of  iterates  satisfies 


cos  8k  >  S,  for  some  S  e  (0,  1)  and  all  k  sufficiently  large,  (11.41) 

Theorem  1 1.6  guarantees  that  jf  rk  -»  0,  meaning  that  the  iterates  approach  stationarity  of 
the  merit  function  /.  Moreover,  if  we  know  that  ||  J(xk)~l  ||  is  bounded  then  we  must  have 
rk  0. 
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We  now  investigate  the  values  of  cos  9k  for  the  directions  generated  by  the  Newton 
and  inexact  Newton  methods.  From  (11.40)  and  (11.38),  we  have  for  the  exact  Newton  step 
(11.6)  that 


PkVf(xk)  _  Ik!!2  >  1  _  1 

llwlll|V/(.yt)||  “  \\Jklrk\\\\J^rk\\  ~  ||Jtr|| U^1 1|  ~~  «(/*)’ 


(11.42) 


When  pk  is  an  inexact  Newton  direction — that  is,  one  that  satisfies  the  condition 
(11.17) — we  have  that 


n  +  JkPkW2  <  ril \\rk II ' 2  =>•  2 p[ Jl rk  +  ||r,t||2  +  \\Jkpk\\2  <  ;r||rA-||2 
=4>  pl^fk  =  pTkJk  n  <  \{rf  -  1)/2]||t>||2 


Meanwhile, 


\\Pk\\  <  ll-4_1ll  [ll»  +  JkPkW  +  ||r*||]  <  H4_1||(»?  + 1)1  \rk 


and 


IIV.MI  =  ||  Jk  rk\\  <  ||  A II  Ik*  II- 


By  combining  these  estimates,  we  obtain 


cos  6k  — 


PkV  ft  >  i  -  >r  >  i  -  n 
II P*ll  II V/* ||  -  2||7^||||4'1||(1  +  ij)  ~  2 x(Jky 


We  conclude  that  a  bound  of  the  form  (11.41)  is  satisfied  both  for  the  exact  and  inexact 
Newton  methods,  provided  that  the  condition  number  k (Jk)  is  bounded. 

When  K(Jk)  is  large,  however,  this  lower  bound  is  close  to  zero,  and  use  of  the  Newton 
direction  may  cause  poor  performance  of  the  algorithm.  In  fact,  the  following  example 
shows  that  condition  cos  9k  can  converge  to  zero,  causing  the  algorithm  to  fail.  This  example 
highlights  a  fundamental  weakness  of  the  line-search  approach. 


□  Example  11.2  (Powell  [241  ]) 

Consider  the  problem  of  finding  a  solution  of  the  nonlinear  system 


Xl 


10.fi 


T  2x  2 


r{x)  — 


( X\  +  0.1) 


(11.43) 
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with  unique  solution  x  *  =  0.  We  try  to  solve  this  problem  using  the  Newton  iteration  (11.37), 
(1 1.39)  where  ak  is  chosen  to  minimize  /  along  pk.  It  is  proved  in  [241]  that,  starting  from 
the  point  (3,  l)r,  the  iterates  converge  to  (1.8016,  0)r  (to  four  digits  of  accuracy).  However, 
this  point  is  not  a  solution  of  (11.43).  In  fact,  it  is  not  even  a  stationary  point  of  /,  and  a 
step  from  this  point  in  the  direction  —V  /  will  produce  a  decrease  in  both  components  of 
r.  To  verify  these  claims,  note  that  the  Jacobian  of  r,  which  is 


J(x)  = 


0 

4X2 


L  (xi+o.l)2 

is  singular  at  all  x  for  which  X2  =  0.  For  such  points,  we  have 

10xi 


V/(x)  = 


Xi  + 


(xi  +0.1)3 
0 


so  that  the  gradient  points  in  the  direction  of  the  positive  X\  axis  whenever  X\  >  0.  The 
point  (1.8016,  0)r  is  therefore  not  a  stationary  point  of  /. 

For  this  example,  a  calculation  shows  that  the  Newton  step  generated  from  an  iterate 
that  is  close  to  (but  not  quite  on)  the  X\  axis  tends  to  be  parallel  to  the  X2  axis,  making  it 
nearly  orthogonal  to  the  gradient  V/(x).  That  is,  cos6(t  for  the  Newton  direction  maybe 
arbitrarily  close  to  zero. 


In  this  example,  a  Newton  method  with  exact  line  searches  is  attracted  to  a  point  of 
no  interest  at  which  the  Jacobian  is  singular.  Since  systems  of  nonlinear  equations  often 
contain  singular  points,  this  behavior  gives  cause  for  concern. 

To  prevent  this  undesirable  behavior  and  ensure  that  (11.41)  holds,  we  may  have  to 
modify  the  Newton  direction.  One  possibility  is  to  add  some  multiple  XkI  of  the  identity  to 
Jk  Jk,  and  define  the  step  pk  to  be 


Pk  —  ~(Jk  Jk  +  ^kl)  1  Jk  rk-  (11.44) 

For  any  Xk  >  0  the  matrix  in  parentheses  is  nonsingular,  and  if  Xk  is  bounded  away  from  zero, 
a  condition  of  the  form  (1 1.41)  is  satisfied.  Therefore,  some  practical  algorithms  choose  Xk 
adaptively  to  ensure  that  the  matrix  in  ( 1 1 .44)  does  not  approach  singularity.  This  approach 
is  analogous  to  the  classical  Levenberg-Marquardt  algorithm  discussed  in  Chapter  10.  To 
implement  it  without  forming  Jk  Jk  explicitly  and  performing  trial  Cholesky  factorizations 
of  the  matrices  (Jk  Jk  +  XI),  we  can  use  the  technique  (10.36)  illustrated  earlier  for  the 
least-squares  case.  This  technique  uses  the  fact  that  the  Cholesky  factor  of  (Jk  Jk  +  XI)  is 
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identical  to  RT,  where  R  is  the  upper  triangular  factor  from  the  QR  factorization  of  the 
matrix 


h 

Vxi 


(11.45) 


A  combination  of  Householder  and  Givens  transformations  can  be  used,  as  for  (10.36),  and 
the  savings  noted  in  the  discussion  following  ( 10.36)  continue  to  hold  if  we  need  to  perform 
this  calculation  for  several  candidate  values  of  Xk. 

The  drawback  of  this  Levenberg-Marquardt  approach  is  that  it  is  difficult  to  choose 
Xk.  If  too  large,  we  can  destroy  the  fast  rate  of  convergence  of  Newton’s  method.  (Note 
that  pk  approaches  a  multiple  of  —Jkrk  as  Xk  f  oo,  so  the  step  becomes  small  and  tends 
to  point  in  the  steepest- descent  direction  for  /.)  If  Xk  is  too  small,  the  algorithm  can  be 
inefficient  in  the  presence  of  Jacobian  singularities.  A  more  satisfactory  approach  is  to  follow 
the  trust-region  approach  described  below,  which  chooses  Xk  indirectly. 

We  conclude  by  specifying  an  algorithm  based  on  Newton-like  steps  and  line  searches 
that  regularizes  the  step  calculations  where  necessary.  Several  details  are  deliberately  left 
vague;  we  refer  the  reader  to  the  papers  cited  above  for  details. 


Algorithm  11.4  (Line  Search  Newton-like  Method). 

Given  c\,  c2  with  0  <  c\  <  c2  < 

Choose  vo; 

for  k  —  0, 1,  2, . . . 

Calculate  a  Newton-like  step  from  (11.6)  (regularizing  with  (11.44) 
if  Jk  appears  to  be  near-singular),  or  (11.17)  or  (11.24); 
if  a  —  1  satisfies  the  Wolfe  conditions  (3.6) 

Set  ak  —  1; 

else 

Perform  a  line  search  to  find  at  >  0  that  satisfies  (3.6); 

end  (if) 

%k+ 1  4  Xk  A  OlkPki 

end  (for) 


TRUST-REGION  METHODS 

The  most  widely  used  trust-region  methods  for  nonlinear  equations  simply  ap¬ 
ply  Algorithm  4.1  from  Chapter  4  to  the  merit  function  f(x)  =  ^  ||/"(jc) |||,  using 
Bk  —  J (.Xk)T  J {xk)  as  the  approximate  Hessian  in  the  model  function  nikip),  which  is 
defined  as  follows: 


mk{p)  =  \\\rk  +  Jkp\\22  =  fk  +  PT  Jk  rk  +  \PT Jk  hPk ■ 
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The  step  pk  is  generated  by  finding  an  approximate  solution  of  the  subproblem 

min  mk(p),  subjectto  ||/?||  <  Ak,  (11.46) 

p 

where  A*  is  the  radius  of  the  trust  region.  The  ratio  pk  of  actual  to  predicted  reduction  (see 
(4.4)),  which  plays  a  critical  role  in  many  trust-region  algorithms,  is  therefore 

_  lk(*fc)ll2  ~  \\r{xk  +  pk)\\2  (1147') 

Pk  \\r{xk)\\2  -  \\r(xk)  +  J(xk)pk\\2' 

We  can  state  the  trust-region  framework  that  results  from  this  model  as  follows. 

Algorithm  1 1.5  (Trust-Resion  Method  for  Nonlinear  Equations). 

Given  A  >  0,  A0  e  (0,  A),  and  i?  e  [0,  |): 
for  k  —  0,  1,  2,  . . . 

Calculate  pk  as  an  (approximate)  solution  of  (11.46); 

Evaluate  pk  from  (11.47); 
if  Pk  <  1 

Ak+ i  =  I II  pk  II; 

else 

iipk  >  |  and  ||/>*||  =  Ak  _ 

Aa.+1  =  min(2A^,  A); 

else 

Aj.+i  =  A/n 

end  (if) 
end  (if) 

if  pk  >  p 

Xk+i  —  xk  +  pk; 

else 

%k+ 1  = 

end  (if) 
end  (for). 

The  dogleg  method  is  a  special  case  of  the  trust-region  algorithm,  Algorithm  4.1, 
that  constructs  an  approximate  solution  to  (11.46)  based  on  the  Cauchy  point  p\  and  the 
unconstrained  minimizer  of  mk.  The  Cauchy  point  is 


Pk  =  -r*(A*/||  Jk  rk\\)Jk  rk,  (11.48) 


where 


xk  =  min  {l,  ||  Jl rk\\3/(Akr[  Jk(J J  Jk)J J rk )}  ; 


(11.49) 
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By  comparing  with  the  general  definition  (4.11),  (4.12)  we  see  that  it  is  not  necessary  to 
consider  the  case  of  an  indefinite  Hessian  approximation  in  mk(p),  since  the  model  Hessian 
//  Jk  that  we  use  is  positive  semidefinite.  The  unconstrained  minimizer  of  mk(p)  is  unique 
when  7a  is  nonsingular.  In  this  case,  we  denote  it  by  p[  and  write 

p\  =  -{J?  Jk)~HJ?rk)  =  —J/7lrk. 

The  selection  of  pk  in  the  dogleg  method  proceeds  as  follows. 

Procedure  11.6  (Dogleg). 

Calculate  p£; 
if  II  Pi II  =  A* 

Pk  pck ; 

else 

Calculate  p\; 

Pk  Pk  +  r(p'k  ~  Pk)>  where  r  is  the  largest  value  in  [0,  1] 
such  that  ||  pa- ||  <  Aa; 

end  (if). 

Lemma  4.2  shows  that  when  Jk  is  nonsingular,  the  vector  pk  chosen  above  is  the 
minimizer  of  nik  along  the  piecewise  linear  path  that  leads  from  the  origin  to  the  Cauchy 
point  and  then  to  the  unconstrained  minimizer  p\ .  Hence,  the  reduction  in  model  function 
at  least  matches  the  reduction  obtained  by  the  Cauchy  point,  which  can  be  estimated  by 
specializing  the  bound  (4.20)  to  the  least-squares  case  by  writing 


wa(0)  -  mk(pk)  >  ci  || //7a  ||  min 


(11.50) 


where  Ci  is  some  positive  constant. 

From  Theorem  4. 1 ,  we  know  that  the  exact  solution  of  ( 1 1 .46)  has  the  form 


Pa  =  ~(J?  Jk  +  hi)-1  J?  rk,  (11.51) 

forsome^A  >  0,  and  that  Xk  —  0  if  the  unconstrained  solution  p\  satisfies  \\p\\\  <  AA.(Note 
that  (11.51)  is  identical  to  the  formula  (10.34a)  from  Chapter  10.  In  fact,  the  Levenberg- 
Marquardt  approach  for  nonlinear  equations  is  a  special  case  of  the  same  algorithm  for 
nonlinear  least-squares  problems. )  The  Levenberg-Marquardt  algorithm  uses  the  techniques 
of  Section  4.3  to  search  for  the  value  of  Xk  that  satisfies  (11.51).  The  procedure  described 
in  the  “exact”  trust-region  algorithm,  Algorithm  4.3,  is  based  on  Cholesky  factorizations, 
but  as  in  Chapter  10,  we  can  replace  these  by  specialized  algorithms  to  compute  the  QR 
factorization  of  the  matrix  (11.45).  Even  if  the  exact  Xk  corresponding  to  the  solution  of 
(11.46)  is  not  found,  the  pk  calculated  from  (11.51)  will  still  yield  global  convergence  if  it 
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satisfies  the  condition  ( 1 1 .50)  for  some  value  of  Ci ,  together  with 

|| Pk ||  <  y  Ak,  for  some  constant  y  >  1.  (11.52) 

The  dogleg  method  requires  just  one  linear  system  to  be  solved  per  iteration,  whereas 
methods  that  search  for  the  exact  solution  of  (11.46)  require  several  such  systems  to  be 
solved.  As  in  Chapter  4,  there  is  a  tradeoff  to  be  made  between  the  amount  of  effort  to  spend 
on  each  iteration  and  the  total  number  of  function  and  derivative  evaluations  required. 

We  can  also  consider  alternative  trust-region  approaches  that  are  based  on  different 
merit  functions  and  different  definitions  of  the  trust  region.  An  algorithm  based  on  the  1 1 
merit  function  with  an  foo-norm  trust  region  gives  rise  to  subproblems  of  the  form 

min  \\Jkp  +  rk  ||i  subject  to  H/rlloo  <  A,  (11.53) 

p 

which  can  be  formulated  and  solved  using  linear  programming  techniques.  This  approach 
is  closely  related  to  the  Sf  iQP  and  SLQP  approaches  for  nonlinear  programming  discussed 
in  Section  18.5. 

Global  convergence  results  of  Algorithm  11.5  when  the  steps  pk  satisfy  (11.50)  and 
(11.52)  are  given  in  the  following  theorem,  which  can  be  proved  by  referring  directly  to 
Theorems  4.5  and  4.6.  The  first  result  is  for  ;;  =  0,  in  which  the  algorithm  accepts  all  steps 
that  produce  a  decrease  in  the  merit  function  /).,  while  the  second  (stronger)  result  requires 
a  strictly  positive  choice  of  rj . 

Theorem  1 1 .7. 

Suppose  that  J(x)  is  Lipschitz  continuous  and  that  ||/(x)||  is  bounded  above  in  a 
neighborhood  V  of  the  level  set  C  —  {x  :  fix)  <  /(.roll-  Suppose  in  addition  that  all 
approximate  solutions  of  (11.46)  satisfy  the  bounds  (11.50)  and  (11.52).  Then  if  i]  =  0  in 
Algorithm  11.5,  we  have  that 


liminf  \\j[ rk\\  =  0, 

k-^-OO 


while  if  rj  e  (0,  |),  we  have 


lim  \\Jkrk\\  —  0. 

k—^OQ 

We  turn  now  to  local  convergence  of  the  trust-region  algorithm  for  the  case  in  which 
the  subproblem  (11.46)  is  solved  exactly.  We  assume  that  the  sequence  {x*}  converges  to 
a  nondegenerate  solution  x*  of  the  nonlinear  equations  r(x)  =  0.  The  significance  of 
this  result  is  that  the  algorithmic  enhancements  needed  for  global  convergence  do  not,  in 
well-designed  algorithms,  interfere  with  the  fast  local  convergence  properties  described  in 
Section  11.1. 
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Theorem  11.8. 

Suppose  that  the  sequence  {xk}  generated  by  Algorithm  11.5  converges  to  a  nondegenerate 
solution  x*  of  the  problem  r(x)  —  0.  Suppose  also  that  J  (x)  is  Lipschitz  continuous  in  an  open 
neighborhood  V  of  x*  and  that  the  trust-region  subproblem  (11.46)  is  solved  exactly  for  all 
sufficiently  large  k.  Then  the  sequence  [xk]  converges  quadratically  to  x*. 

PROOF.  We  prove  this  result  by  showing  that  there  is  an  index  K  such  that  the  trust-region 
radius  is  not  reduced  further  after  iteration  K ;  that  is,  A*  >  AK  for  all  k  >  K.  We  then 
show  that  the  algorithm  eventually  takes  the  pure  Newton  step  at  every  iteration,  so  that 
quadratic  convergence  follows  from  Theorem  11.2. 

Let  pu  denote  the  exact  solution  of  ( 1 1.46).  Note  first  that  pk  will  simply  be  the  uncon¬ 
strained  Newton  step  —  Jf 1  rk  whenever  this  step  satisfies  the  trust-region  bound.  Otherwise, 
we  have  ||./Jfc_k*||  >  A *,  while  the  solution  pk  satisfies  ||/j*||  <  A*.  In  either  case,  we  have 

II Pt II  <  WJk'nW-  (11.54) 

We  consider  the  ratio  pk  of  actual  to  predicted  reduction  defined  by  (1 1.47).  We  have 
directly  from  the  definition  that 


| Ik*  +  -4p*||2  -  \\r(xk  +  pk) ||2 
Pk  ~  \\r(.xk)\\2  -  \\r(xk)  +  J(xk)pk\\2' 

From  Theorem  1 1.1,  we  have  for  the  second  term  in  the  numerator  that 

\\r(xk  +  pk) ||2  =  ||  [r (xk)  +  J(xk)Pk\  +  w(xk,  xk  +  pk) ||2  , 


(11.55) 


(11.56) 


where  wf,  ■)  is  defined  as  in  (11.11).  Because  of  Lipschitz  continuity  of  J  with  Lipschitz 
constant  fL  (11.7),  we  have 


I|m>(**,  xk  +  pk) ||  <  /  ||  J(xk  +  tpk)  -  J(xk) ||  \\pk\\  dt 
Jo 

<  f  fL\\pk\\2  dt  —  {fL/2)\\pk\\2, 

Jo 

so  that  using  (11.56)  and  the  fact  that  \\rk  +  JkPkW  <  lk*||  =  /(^)1/2  (since  pk  is  the 
solution  of  (11.46)),  we  can  bound  the  numerator  as  follows: 

|  Ik*  +  -4p*II2  -  lk(**  +  p*)ll2| 

<  2|k*  +  JkPkW  \\w(xk,  xk  +  pk) II  +  | \w(xk,  xk  +  Pk) II2 

</U*)1/2^||p*||2  +  (^/2)2||p*||4 

<  €(xk)\\pk\\2, 


(11.57) 


1  1.2. 
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where  we  define 


e(xk)  =  f(xk)1/2fiL  +  {PL/2)2\\Pk\\2- 

Since  xk  x*  by  assumption,  it  follows  that  f(xk)  -»  0  and  ||r*  ||  -»  0.  Because  v*  is  a 
nondegenerate  root,  we  have  as  in  (11.14)  that  ||./(x£)_1||  <  /3*  for  all  k  sufficiently  large, 
so  from  (11.54),  we  have 


Pk\\  <  \\Jklrk\\  <  P*\\rk\\  0. 


(11.58) 


Hence,  e(xk)  -»  0. 

Turning  now  to  the  denominator  of  (11.55),  we  define  pk  to  be  a  step  of  the  same 
length  as  the  solution  pk  in  the  Newton  direction  —  J^lrk,  that  is, 


Pk  =  - 


II  Pk  11 

■4  }'k 


Jk  ’m- 


Since  pk  is  feasible  for  (11.46),  and  since  pk  is  optimal  for  this  subproblem,  we  have 


\\rk\\2-\\rk  +  JkPk\\2>  \\rk\\2 


n 


II  Pt  II 

- i - rk 

\Jk  n\\ 


2  II  ft  II 

II  Jkln 

II  ft  II 

II 4“^  II 


\\rk\ 


n\ 


2 


where  for  the  last  inequality  we  have  used  (11.54).  By  using  (11.58)  again,  we  have  from  this 
bound  that 


Ikill2  -  \\rk  +  JkPkW1  >  || a- II 2  >  -^ || ft || || ^ || .  (11.59) 

114  v*  ii  P 

By  substituting  (11.57)  and  (1 1.59)  into  ( 1 1.55),  and  then  applying  (11.58)  again,  we  have 

P*e(xk)\\pk\\2  ,a*\2  /  ^  a  /,  , 

|l-ftl<— c — .  <  (P  )  e(xk)  ->  0.  (11.60) 

II  ft  II II  fill 

Therefore,  for  all  k  sufficiently  large,  we  have  pk  >  | ,  and  so  the  trust  region  radius  Ak  will 
not  be  increased  beyond  this  point.  As  claimed,  there  is  an  index  K  such  that 

At  >  Ak,  ior  a\\k>K. 
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Since  r*||  <  yS*||rJt||  — >  0,  the  Newton  step  will  eventually  be  smaller 

than  Ak  (and  hence  A*),  so  it  will  eventually  always  be  accepted  as  the  solution  of  (11.46). 
The  result  now  follows  from  Theorem  1 1.2.  □ 

We  can  replace  the  assumption  that  x*  — »■  x*  with  an  assumption  that  the  nonde¬ 
generate  solution  x*  is  just  one  of  the  limit  points  of  the  sequence.  (In  fact,  this  condition 
implies  that  x^  -a-  x*;  see  Exercise  11.9.) 


1 1 . 3  CONTINUATION/HOMOTOPy  METHODS 

MOTIVATION 

We  mentioned  above  that  Newton-based  methods  all  suffer  from  one  shortcoming: 
Unless  J(x )  is  nonsingular  in  the  region  of  interest — a  condition  that  often  cannot  be 
guaranteed — they  are  in  danger  of  converging  to  a  local  minimum  of  the  merit  function 
rather  that  is  not  a  solution  of  the  nonlinear  system.  Continuation  methods,  which  we 
outline  in  this  section,  are  more  likely  to  converge  to  a  solution  of  r  (x)  =  0  in  difficult  cases. 
Their  underlying  motivation  is  simple  to  describe:  Rather  than  dealing  with  the  original 
problem  r(x )  =  0  directly,  we  set  up  an  “easy”  system  of  equations  for  which  the  solution 
is  obvious.  We  then  gradually  transform  the  easy  system  into  the  original  system  r(x),  and 
follow  the  solution  as  it  moves  from  the  solution  of  the  easy  problem  to  the  solution  of  the 
original  problem. 

One  simple  way  to  define  the  so-called  homotopy  map  H (x,  X)  is  as  follows: 

H(x,  X)  —  Xr(x)  +  (1  —  A.)(x  —  a),  (11.61) 

where  A  is  a  scalar  parameter  and  a  e  R'!  is  a  fixed  vector.  When  X  =  0,  (11.61)  defines  the 
artificial,  easy  problem  H(x ,  0)  =  x  —  a,  whose  solution  is  obviously  x  =  a.  When  X  —  1, 
we  have  H(x ,  1)  =  r(x),  the  original  system  of  equations. 

To  solve  r(x)  —  0,  consider  the  following  algorithm:  First,  set  X  =  0  in  (1 1.61)  and  set 
x  =  a.  Then,  increase  X  from  0  to  1  in  small  increments,  and  for  each  value  of  X,  calculate 
the  solution  of  the  system  H(x,  X)  —  0.  The  final  value  of  x  corresponding  to  X  =  1  will 
solve  the  original  problem  r(x)  —  0. 

This  naive  approach  sounds  plausible,  and  Figure  1 1.3  illustrates  a  situation  in  which 
it  would  be  successful.  In  this  figure,  there  is  a  unique  solution  x  of  the  system  H{x,  X)  —  0 
for  each  value  of  A.  in  the  range  [  0 ,  1  ] .  The  traj  ectory  of  points  (x ,  X )  for  which  H  (x ,  X )  =  0 
is  called  the  zero  path. 

Unfortunately,  however,  the  approach  often  fails,  as  illustrated  in  Figure  11.4.  Here, 
the  algorithm  follows  the  lower  branch  of  the  curve  from  A  =  0toA  =  A7',  but  it  then  loses 
the  trail  unless  it  is  lucky  enough  to  jump  to  the  top  branch  of  the  path.  The  value  XT  is 
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Figure  11.3  Plot  of  a  zero  path:  Trajectory  of  points  ( x ,  X)  with  H(x ,  X)  —  0. 


Figure  11.4  Zero  path  with  turning  points.  The  path  joining  (a,  0)  to  (x* ,  1)  cannot 
be  followed  by  increasing  X  monotonically  from  0  to  1 . 


known  as  a  turning  point,  since  at  this  point  we  can  follow  the  path  smoothly  only  if  we  no 
longer  insist  on  increasing  X  at  every  step.  In  fact,  practical  continuation  methods  work  by 
doing  exactly  as  Figure  1 1.4  suggests,  that  is,  they  follow  the  zero  path  explicitly,  even  if  this 
means  allowing  X  to  decrease  from  time  to  time. 

PRACTICAL  CONTINUATION  METHODS 

In  one  practical  technique,  we  model  the  zero  path  by  allowing  both  x  and  X  to  be 
functions  of  an  independent  variable  s  that  represents  arc  length  along  the  path.  That  is, 
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(x(s),  X{s))  is  the  point  that  we  arrive  at  by  traveling  a  distance  s  along  the  path  from  the 
initial  point  (x(0),  A(0))  =  (a,  0).  Because  we  have  that 

H(x(s),  X(s))  =  0,  for  all  s  >  0, 

we  can  take  the  total  derivative  of  this  expression  with  respect  to  s  to  obtain 

3  3  .  .  / dx  dX\ 

—  H(x,  X)x  H - H{x,  X)X  —  0,  where  (x,  X)  —  I  — ,  —  )  .  (11.62) 

dx  dX  \ds  ds ) 

The  vector  (x(s),  X(s))  is  the  tangent  vector  to  the  zero  path,  as  we  illustrate  in  Figure  1 1.4. 
From  (11.62),  we  see  that  it  lies  in  the  null  space  of  the  n  x  ( n  +  1)  matrix 

~  d  d 

—  H{x,X)  —H{x,  X)  .  (11.63) 

dx  dX 

When  this  matrix  has  full  rank,  its  null  space  has  dimension  1 ,  so  to  complete  the  definition 
of  (x,  X)  in  this  case,  we  need  to  assign  it  a  length  and  direction.  The  length  is  fixed  by 
imposing  the  normalization  condition 

||i(s)||2  +  |i(i)|2  =  1,  foralls,  (11.64) 

which  ensures  that  s  is  the  true  arc  length  along  the  path  from  (0,  a)  to  (x(s),  A.(j)).  We  need 
to  choose  the  sign  to  ensure  that  we  keep  moving  forward  along  the  zero  path.  A  heuristic 
that  works  well  is  to  choose  the  sign  so  that  the  tangent  vector  (i,  X)  at  the  current  value  of 
.s  makes  an  angle  of  less  than  7r/2  with  the  tangent  point  at  the  previous  value  of  s. 

We  can  outline  the  complete  procedure  for  computing  (x,  X)  as  follows: 

Procedure  11.7  (Tansent  Vector  Calculation). 

Compute  a  vector  in  the  null  space  of  ( 1 1 .63)  by  performing  a  QR 
factorization  with  column  pivoting, 

Qt  -H~H(x,X)  ^ H(x,X )  n  =  [  R  w  ], 

dx  dA 

where  Q  is  n  x  n  orthogonal,  R  is  n  x  n  upper  triangular,  n  is 
an  (n  +  1)  x  (n  +  1)  permutation  matrix,  and  w  e  R". 


Set 
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Set  (x,X)  —  ±v/||v||2»  where  the  sign  is  chosen  to  satisfy  the  angle 
criterion  mentioned  above. 

Details  of  the  QR  factorization  procedure  are  given  in  the  Appendix. 

Since  we  can  obtain  the  tangent  at  any  given  point  {x,  X)  and  since  we  know  the  initial 
point  (x(0),  1(0))  =  {a,  0),  we  can  trace  the  zero  path  by  calling  a  standard  initial-value 
first-order  ordinary  differential  equation  solver,  terminating  the  algorithm  when  it  finds  a 
value  of  s  for  which  X(s)  —  1. 

A  second  approach  for  following  the  zero  path  is  quite  similar  to  the  one  just  described, 
except  that  it  takes  an  algebraic  viewpoint  instead  of  a  differential-equations  viewpoint. 
Given  a  current  point  (x,  X),  we  compute  the  tangent  vector  (x,  X)  as  above,  and  take  a 
small  step  (of  length  e,  say)  along  this  direction  to  produce  a  “predictor”  point  (xp ,  Xp); 
that  is, 


{xp ,  Xp )  =  (x,  1)  +  e{x,  X). 


Usually,  this  new  point  will  not  lie  exactly  on  the  zero  path,  so  we  apply  some  “corrector” 
iterations  to  bring  it  back  to  the  path,  thereby  identifying  a  new  iterate  (x + ,  1+ )  that  satisfies 
H(x+,X+)  —  0.  (This  process  is  illustrated  in  Figure  11.5.)  During  the  corrections,  we 
choose  a  component  of  the  predictor  step  {xp ,  Xp) — one  of  the  components  that  has  been 
changing  most  rapidly  during  the  past  few  steps — and  hold  this  component  fixed  during 
the  correction  process.  If  the  index  of  this  component  is  i,  and  if  we  use  a  pure  Newton 
corrector  process  (often  adequate,  since  (xp ,  Xp )  is  usually  quite  close  to  the  target  point 


Figure  11.5  The  algebraic  predictor-corrector  procedure,  using  X  as  the  fixed 
variable  in  the  correction  process. 
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(x+,  A+)),  the  steps  will  have  the  form 


"  3  H 

dx 

3  H  - 
~dX 

Sx 

-H 

L  «  J 

8X 

0 

where  the  quantities  dH/dx,  dH/dX,  and  H  are  evaluated  at  the  latest  point  of  the  corrector 
process.  The  last  row  of  this  system  serves  to  fix  the  ith  component  of  (<5x,  SX)  at  zero;  the 
vector  e,  e  R"+1  is  a  vector  with  n  +  1  components  containing  all  zeros,  except  for  a  1 
in  the  location  i  that  corresponds  to  the  fixed  component.  Note  that  in  Figure  11.5  the  X 
component  is  chosen  to  be  fixed  on  the  current  iteration.  On  the  following  iteration,  it  may 
be  more  appropriate  to  choose  x  as  the  fixed  component,  as  we  reach  the  turning  point  in  X. 

The  two  variants  on  path-following  described  above  are  able  to  follow  curves  like 
those  depicted  in  Figure  11.4  to  a  solution  of  the  nonlinear  system.  They  rely,  however,  on 
the  n  x  (n  +  1)  matrix  in  ( 1 1.63)  having  full  rank  for  all  (x,  X)  along  the  path,  so  that  the 
tangent  vector  is  well-defined.  The  following  result  shows  that  full  rank  is  guaranteed  under 
certain  assumptions. 

Theorem  11.9  (Watson  [305]). 

Suppose  that  r  is  twice  continuously  differentiable.  Then  for  almost  all  vectors  a  e  R", 
there  is  a  zero  path  emanating  from  (0,  a)  along  which  then  x  ( n  +  1)  matrix  (11.63)  has  full 
rank.  If  this  path  is  bounded  for  X  e  [0,  1 ),  then  it  has  an  accumulation  point  (x ,  1 )  such  that 
r(x)  —  0.  Furthermore,  if  the  Jacobian  J(x )  is  nonsingular,  the  zero  path  between  (a,  0)  and 
(x,  1)  has  finite  arc  length. 

The  theorem  assures  us  that  unless  we  are  unfortunate  in  the  choice  of  a,  the  algorithms 
described  above  can  be  applied  to  obtain  a  path  that  either  diverges  or  else  leads  to  a  point 
x  that  is  a  solution  of  the  original  nonlinear  system  if  J  (x)  is  nonsingular.  More  detailed 
convergence  results  can  be  found  in  Watson  [305]  and  the  references  therein. 

We  conclude  with  an  example  to  show  that  divergence  of  the  zero  path — the  less 
desirable  outcome  of  Theorem  11.9 — can  happen  even  for  innocent-looking  problems. 


□  Example  1 1 .3 

Consider  the  system  r  (x)  =  x2  —  1,  for  which  there  are  two  nondegenerate  solutions 
+ 1  and  —  1.  Suppose  we  choose  a  —  —  2  and  attempt  to  apply  a  continuation  method  to  the 
function 

H(x,  X)  —  X(x~  —  1)  -f-  (1  —  X)(x  T*  2)  =  Xx~  +  (1  —  X)x  *T  (2  —  32.),  (11.65) 


obtained  by  substituting  into  (11.61).  The  zero  paths  for  this  function  are  plotted  in 
Figure  11.6.  As  can  be  seen  from  that  diagram,  there  is  no  zero  path  that  joins  (—2,  0) 


Figurell.6  Zero  paths  for  the  example  in  which  H(x,  X)  =  X(x2— 1)+(1— X)(x+2). 
There  is  no  continuous  zero  path  from  X  —  0  to  X  =  1 . 


to  either  (1,  1)  or  (—1,  1),  so  the  continuation  methods  fail  on  this  example.  We  can  find  the 
values  of  X  for  which  no  solution  exists  by  using  the  formula  for  a  quadratic  root  to  obtain 

-(1  -  X)  ±  J(l  -  X)2  -  4X(2  -  3X) 

x  —  - . 

2X 

Now,  when  the  term  in  the  square  root  is  negative,  the  corresponding  values  of.r  are  complex, 
that  is,  there  are  no  real  roots  x.  It  is  easy  to  verify  that  such  is  the  case  when 


5-2^3  5  +  2v^\ 

13  ’  13  y 


(0.118,0.651). 


Note  that  the  zero  path  starting  from  (—2,  0)  becomes  unbounded,  which  is  one  of  the 
possible  outcomes  of  Theorem  1 1.9. 


This  example  indicates  that  continuation  methods  may  fail  to  produce  a  solution  even 
to  a  fairly  simple  system  of  nonlinear  equations.  However,  it  is  generally  true  that  they  are 
more  reliable  than  the  merit-function  methods  described  earlier  in  the  chapter.  The  extra 
robustness  comes  at  a  price,  since  continuation  methods  typically  require  significantly  more 
computational  effort  than  the  merit-function  methods. 
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NOTES  AND  REFERENCES 

Nonlinear  differential  equations  and  integral  equations  are  a  rich  source  of  nonlinear 
equations.  When  formulated  as  finite-dimensional  nonlinear  equations,  the  unknown  vector 
x  is  a  discrete  approximation  to  the  (infinite-dimensional)  solution.  In  other  applications, 
the  vector  x  is  intrinsically  finite-dimensional;  it  may  represent  the  quantities  of  materials 
to  be  transported  between  pairs  of  cities  in  a  distribution  network,  for  instance.  In  all  cases, 
the  equations  r,-  enforce  consistency,  conservation,  and  optimality  principles  in  the  model. 
More  [212]  and  Averick  et  al.  [10]  discuss  a  number  of  interesting  practical  applications. 

For  analysis  of  the  convergence  of  Broyden’s  method,  including  proofs  of  Theo¬ 
rem  11.5,  see  Dennis  and  Schnabel  [92,  Chapter  8]  and  Kelley  [177,  Chapter  6].  Details  on  a 
limited-memory  implementation  of  Broyden’s  method  are  given  by  Kelley  [177,  Section  7.3] . 

Example  11.2  and  the  algorithm  described  by  Powell  [241]  have  been  influential 
beyond  the  field  of  nonlinear  equations.  The  example  shows  that  a  line-search  method 
may  not  be  able  to  achieve  sufficient  decrease,  whereas  the  Cauchy  step  in  the  trust-region 
approach  is  designed  to  guarantee  that  this  condition  holds  and  hence  that  reasonable 
convergence  properties  are  guaranteed.  The  dogleg  algorithm  proposed  in  [241]  can  be 
viewed  as  one  of  the  first  modern  trust-region  methods. 


&  Exercises 

&  11.1  Show  that  for  any  vector  s  e  R" ,  we  have 


where  ||  •  ||  denotes  the  Euclidean  matrix  norm. 

&  11.2  Consider  the  function  r  :  R  -a-  R  defined  by  r(x)  —  xq,  where  q  is  an  integer 

greater  than  2.  Note  that  x*  —  0  is  the  sole  root  of  this  function  and  that  it  is  degenerate. 
Show  that  Newton’s  method  converges  Q-linearly,  and  find  the  value  of  the  convergence 
ratio  r  in  (A.34). 

&  11.3  Show  that  Newton’s  method  applied  to  the  function  r(x)  —  —x5  +  x 3  +  4x 

starting  from  xo  =  1  produces  the  cyclic  behavior  described  in  the  text.  Find  the  roots  of 
this  function,  and  check  that  they  are  nondegenerate. 

&  11.4  For  the  scalar  function  r(x)  —  sin(5x)  —  x,  show  that  the  sum-of-squares  merit 

function  has  infinitely  many  local  minima,  and  find  a  general  formula  for  such  points. 

&  11.5  When/-  :  R"  -»  R” ,  show  that  the  function 


<p{k)  =  |  (JTJ  +  U)~lJTr 
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is  monotonically  decreasing  in  X  unless  JTr  —  0.  (Hint:  Use  the  singular- value 
decomposition  of/.) 

11.6  Prove  part  (iii)  of  Theorem  11.3. 

i#7  1 1.7  Consider  a  line-search  Newton  method  in  which  the  step  length  ak  is  chosen  to 

be  the  exact  minimizer  of  the  merit  function  /(•);  that  is, 

ak  —  arg  min  /  (xk  -  a  Jklrk). 

a 


Show  that  if  /(x)  is  nonsingular  at  the  solution  x*,  then  ak  — >  1  as  xk  -»  x*. 

i#7  11.8  Let  /  e  R',xm  and  r  e  R”  and  suppose  that  J  JTr  =  0.  Show  that  JTr  =  0. 

(Hint:  This  doesn’t  even  take  one  line!) 

i#7  11.9  Suppose  we  replace  the  assumption  of  xk  — >  x*  in  Theorem  11.8  by  an  assump¬ 

tion  that  the  nondegenerate  solution  x*  is  a  limit  point  of  x*.  By  adding  some  logic  to  the 
proof  of  this  result,  show  that  in  fact  x*  is  the  only  possible  limit  point  of  the  sequence. 
(Hint:  Show  that  ||  /^Ty.+j  ||  <  \  \\Jk^rk\\  for  all  k  sufficiently  large,  and  hence  that  for  any 
constant  e  >  0,  the  sequence  {x*}  satisfies  \\xk  —  x*||  <  e  for  all  k  sufficiently  large.) 

i#7  11.10  Consider  the  following  modification  of  our  example  of  failure  of  continuation 

methods: 


r(x)  =  x2  —  1,  a  —  i. 

Show  that  for  this  example  there  is  a  zero  path  for  H(x,  X)  —  X(x2  —  1)  +  (1  —  a)(x  —  a) 
that  connects  (|,  0)  to  (1,  1),  so  that  continuation  methods  should  work  for  this  choice  of 
starting  point. 
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The  second  part  of  this  book  is  about  minimizing  functions  subject  to  constraints  on  the 
variables.  A  general  formulation  for  these  problems  is 


[  Cj(.r)  =  0,  i  e  £, 

min  fix)  subject  to  <  (12.1) 

reR"  [  a(x)  >  0,  i  el, 


where  /  and  the  functions  c,-  are  all  smooth,  real-valued  functions  on  a  subset  of  R”,  and 
X  and  £  are  two  finite  sets  of  indices.  As  before,  we  call  f  the  objective  function,  while  c,- , 
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i  e  £  are  the  equality  constraints  and  c,- ,  i  e  1  are  the  inequality  constraints.  We  define  the 
feasible  set  £2  to  be  the  set  of  points  x  that  satisfy  the  constraints;  that  is, 

C2  =  [x  |  c,-(jt)  =  0,  i  e  £\  c,(x)  >0,  i  e  X},  (12.2) 

so  that  we  can  rewrite  (12.1)  more  compactly  as 

min  f(x).  (12.3) 

In  this  chapter  we  derive  mathematical  characterizations  of  the  solutions  of  (12.3).  As 
in  the  unconstrained  case,  we  discuss  optimality  conditions  of  two  types.  Necessary  condi¬ 
tions  are  conditions  that  must  be  satisfied  by  any  solution  point  ( under  certain  assumptions) . 
Sufficient  conditions  are  those  that,  if  satisfied  at  a  certain  point  x*,  guarantee  that  x*  is  in 
fact  a  solution. 

For  the  unconstrained  optimization  problem  of  Chapter  2,  the  optimality  conditions 
were  as  follows: 

Necessary  conditions:  Local  unconstrained  minimizers  have  V/(x*)  =  0  and 

V2/(v*)  positive  semidefinite. 

Sufficient  conditions:  Any  point  x*  at  which  V/(x*)  =  0  and  V2/(x*)  is  positive 

definite  is  a  strong  local  minimizer  of  /. 

In  this  chapter,  we  derive  analogous  conditions  to  characterize  the  solutions  of  constrained 
optimization  problems. 

LOCAL  AND  GLOBAL  SOLUTIONS 

We  have  seen  already  that  global  solutions  are  difficult  to  find  even  when  there  are 
no  constraints.  The  situation  may  be  improved  when  we  add  constraints,  since  the  feasible 
set  might  exclude  many  of  the  local  minima  and  it  may  be  comparatively  easy  to  pick  the 
global  minimum  from  those  that  remain.  However,  constraints  can  also  make  things  more 
difficult.  As  an  example,  consider  the  problem 

min  (*2  +  100)2  +  O.Ol.v2,  subject  to  xi  —  cosxi  >  0,  (12.4) 

illustrated  in  Figure  12.1.  Without  the  constraint,  the  problem  has  the  unique  solution 
(0,  —  100)r.  With  the  constraint,  there  are  local  solutions  near  the  points 

x ^  =  (kn,  —  l)r,  for  k  —  ±1,  ±3,  ±5, _ 

Definitions  of  the  different  types  of  local  solutions  are  simple  extensions  of  the  corre¬ 
sponding  definitions  for  the  unconstrained  case,  except  that  now  we  restrict  consideration 
to  the  feasible  points  in  the  neighborhood  of  .r*.  We  have  the  following  definition. 
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Figure  12.1  Constrained  problem  with  many  isolated  local  solutions. 

A  vector  x*  is  a  local  solution  of  the  problem  (12.3)  if  x*  e  £2  and  there  is  a 

neighborhood  A f  of  x*  such  that  f(x)  >  f(x*)  for  x  e  AT  fl  £2. 

Similarly,  we  can  make  the  following  definitions: 

A  vector  x*  is  a  strict  local  solution  (also  called  a  strong  local  solution)  if  x*  e  £2  and  there 

is  a  neighborhood  AT  of  x*  such  that  /(x)  >  f(x*)  for  all  x  e  AT  D  £2  with  x  ^  x*. 

A  point  x*  is  an  isolated  local  solution  if  x*  e  £2  and  there  is  a  neighborhood  Af  of  x* 

such  that  x*  is  the  only  local  solution  in  AT  Cl  £2. 

Note  that  isolated  local  solutions  are  strict,  but  that  the  reverse  is  not  true  (see 
Exercise  12.2). 

SMOOTHNESS 

Smoothness  of  objective  functions  and  constraints  is  an  important  issue  in  character¬ 
izing  solutions,  just  as  in  the  unconstrained  case.  It  ensures  that  the  objective  function  and 
the  constraints  all  behave  in  a  reasonably  predictable  way  and  therefore  allows  algorithms 
to  make  good  choices  for  search  directions. 

We  saw  in  Chapter  2  that  graphs  of  nonsmooth  functions  contain  “kinks"  or  “jumps” 
where  the  smoothness  breaks  down.  If  we  plot  the  feasible  region  for  any  given  constrained 
optimization  problem,  we  usually  observe  many  kinks  and  sharp  edges.  Does  this  mean  that 
the  constraint  functions  that  describe  these  regions  are  nonsmooth?  The  answer  is  often 
no,  because  the  nonsmooth  boundaries  can  often  be  described  by  a  collection  of  smooth 
constraint  functions.  Figure  12.2  shows  a  diamond-shaped  feasible  region  in  R2  that  could 
be  described  by  the  single  nonsmooth  constraint 

||x ||  i  =  | x ! |  +  |x2 1  <  1.  (12.5) 

It  can  also  be  described  by  the  following  set  of  smooth  (in  fact,  linear)  constraints: 

X\  +  X'2  <1,  X\  —  X2  <  1,  — Xi  +  X2  <  1,  — Xi  —  X2  <  1.  (12.6) 

Each  of  the  four  constraints  represents  one  edge  of  the  feasible  polytope.  In  general,  the  con¬ 
straint  functions  are  chosen  so  that  each  one  represents  a  smooth  piece  of  the  boundary  of  £2. 
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Figure  12.2 

A  feasible  region  with  a 
nonsmooth  boundary  can  be 
described  by  smooth  constraints. 


Nonsmooth,  unconstrained  optimization  problems  can  sometimes  be  reformulated  as 
smooth  constrained  problems.  An  example  is  the  unconstrained  minimization  of  a  function 

f(x)  —  max(x2,  x),  (12.7) 

which  has  kinks  at  x  —  0  and  x  —  1,  and  the  solution  at  x*  =  0.  We  obtain  a  smooth, 
constrained  formulation  of  this  problem  by  adding  an  artificial  variable  t  and  writing 

mint  s.t.  t>x,  t>x2.  (12.8) 

Reformulation  techniques  such  as  (12.6)  and  (12.8)  are  used  often  in  cases  where  /  is  a 
maximum  of  a  collection  of  functions  or  when  /  is  a  1-norm  or  oo-norm  of  a  vector 
function. 

In  the  examples  above  we  expressed  inequality  constraints  in  a  slightly  different  way 
from  the  form  c,(x)  >  0  that  appears  in  the  definition  (12.1).  However,  any  collection  of 
inequality  constraints  with  >  and  <  and  nonzero  right-hand-sides  can  be  expressed  in  the 
form  Cj(x)  >  0  by  simple  rearrangement  of  the  inequality. 


12.1  EXAMPLES 


To  introduce  the  basic  principles  behind  the  characterization  of  solutions  of  constrained 
optimization  problems,  we  work  through  three  simple  examples.  The  discussion  here  is 
informal;  the  ideas  introduced  will  be  made  rigorous  in  the  sections  that  follow. 

We  start  by  noting  one  important  item  of  terminology  that  recurs  throughout  the  rest 
of  the  book. 
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Definition  12.1. 

The  active  set  A(x  )  at  any  feasible  x  consists  of  the  equality  constraint  indices  from  £ 
together  with  the  indices  of  the  inequality  constraints  i  for  which  c,(x)  =  0;  that  is, 

A(x)  —  £  U  {i  e  1\  Cj(x)  —  0}. 

At  a  feasible  point  x,  the  inequality  constraint  i  e  X  is  said  to  be  active  if  c,(x)  =  0 
and  inactive  if  the  strict  inequality  c,-  (x)  >  0  is  satisfied. 

A  SINGLE  EQUALITY  CONSTRAINT 


□  Example  1 2.1 

Our  first  example  is  a  two-variable  problem  with  a  single  equality  constraint: 

min  xi  +  X2  s.t.  x\  +  x\  —  2  =  0  (12.9) 

(see  Figure  12.3).  In  the  language  of  (12.1),  we  have  /(x)  =  X\  +  x2,X  —  0,  £  —  {1},  and 
d(x)  =  x\  +  xf  —  2.  We  can  see  by  inspection  that  the  feasible  set  for  this  problem  is  the 
circle  of  radius  \fl  centered  at  the  origin — just  the  boundary  of  this  circle,  not  its  interior. 
The  solution  x*  is  obviously  (—1,  —  l)r.  From  any  other  point  on  the  circle,  it  is  easy  to 
find  a  way  to  move  that  stays  feasible  (that  is,  remains  on  the  circle)  while  decreasing  f. 
For  instance,  from  the  point  x  =  (V2,  0)r  any  move  in  the  clockwise  direction  around  the 
circle  has  the  desired  effect. 


Figure  12.3 

Problem  (12.9),  showing 
constraint  and  function 
gradients  at  various  feasible 
points. 


12.1.  Examples  309 


We  also  see  from  Figure  12.3  that  at  the  solution  x*,  the  constraint  normal  Vci  (pc*)  is 
parallel  to  V  f(x*).  That  is,  there  is  a  scalar  A*  (in  this  case  A*  =  —1/2)  such  that 

V/(x*)  =  A*Vci(x*).  (12.10) 

□ 


We  can  derive  (12.10)  by  examining  first-order  Taylor  series  approximations  to  the 
objective  and  constraint  functions.  To  retain  feasibility  with  respect  to  the  function  C\  (x)  = 
0,  we  require  any  small  (but  nonzero)  step  s  to  satisfy  that  ci(x  +  s)  =  0;  that  is, 

0  =  ci(x  +  s)  &  C\(x)  +  Vci(x)Ts  —  Vci(x)Ts.  (12.11) 

Hence,  the  step  s  retains  feasibility  with  respect  to  Ci,  to  first  order,  when  it  satisfies 

Vci(.T)rs  =  0.  (12.12) 

Similarly,  if  we  want  s  to  produce  a  decrease  in  /,  we  would  have  so  that 

0  >  f{x  +  s)~  fix)  Vfixfs, 


or,  to  first  order, 


V  fix)Ts  <  0.  (12.13) 

Existence  of  a  small  step  s  that  satisfies  both  (12.12)  and  (12.13)  strongly  suggests  existence 
of  a  direction  d  (where  the  size  of  d  is  not  small;  we  could  have  d  j/||j||  to  ensure  that 
the  norm  of  d  is  close  to  1)  with  the  same  properties,  namely 

Vci(x)rd  =  0  and  V/(x)rd  <  0.  (12.14) 


If,  on  the  other  hand,  there  is  no  direction  d  with  the  properties  (12.14),  then  is  it  likely  that 
we  cannot  find  a  small  step  s  with  the  properties  (12.12)  and  (12.13).  In  this  case,  x*  would 
appear  to  be  a  local  minimizer. 

By  drawing  a  picture,  the  reader  can  check  that  the  only  way  that  a  d  satisfying  (12.14) 
does  notexistisifV/(x)  and  Vci(x)  are  parallel,  that  is,  if  the  condition  V  fix)  —  Ai  Vci(x) 
holds  at  x,  for  some  scalar  A.!.  If  in  fact  V  fix)  and  Vci(x)  are  not  parallel,  we  can  set 


d 


I  - 


Vci(x)Vci(x)z 

l|VCl(x)||2 


V/(x); 


(12.15) 


It  is  easy  to  verify  that  this  d  satisfies  ( 12. 14). 
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By  introducing  the  Lagrangian  function 

C(x,  A.i)  =  f(x)  —  A.1Ci(.r),  (12.16) 

and  noting  that  Vx£(x,  a,  )  =  V/(x)  —  liVci(x),  we  can  state  the  condition  (12.10) 
equivalently  as  follows:  At  the  solution  x*,  there  is  a  scalar  A*  such  that 

Vv£(x*,  A.*)  =  0.  (12.17) 

This  observation  suggests  that  we  can  search  for  solutions  of  the  equality-constrained 
problem  (12.9)  by  seeking  stationary  points  of  the  Lagrangian  function.  The  scalar  quantity 
Ai  in  (12.16)  is  called  a  Lagrange  multiplier  for  the  constraint  Cj(x)  =  0. 

Though  the  condition  (12.10)  (equivalently,  (12.17))  appears  to  be  necessary  for 
an  optimal  solution  of  the  problem  (12.9),  it  is  clearly  not  sufficient.  For  instance,  in 
Example  12.1,  condition  (12.10)  is  satisfied  at  the  point  x  —  (1,  l)r  (with  Ai  =  |),  but 
this  point  is  obviously  not  a  solution — in  fact,  it  maximizes  the  function  /  on  the  circle. 
Moreover,  in  the  case  of  equality-constrained  problems,  we  cannot  turn  the  condition 
( 12.10)  into  a  sufficient  condition  simply  by  placing  some  restriction  on  the  sign  of  A^  To 
see  this,  consider  replacing  the  constraint  x\  +  x\  —  2  =  0  by  its  negative  2  —  xj  —  xj  —  0  in 
Example  12.1.  The  solution  of  the  problem  is  not  affected,  but  the  value  of  A*  that  satisfies 
the  condition  (12.10)  changes  from  A*  =  —  |  to  A*  =  ^. 

A  SINGLE  INEQUALITY  CONSTRAINT 


□  Example  1 2.2 

This  is  a  slight  modification  of  Example  12.1,  in  which  the  equality  constraint  is 
replaced  by  an  inequality.  Consider 

min  Xi  +  X2  s.t.  2  —  x\  —  x\  >  0,  (12.18) 

for  which  the  feasible  region  consists  of  the  circle  of  problem  ( 12.9)  and  its  interior  (see 
Figure  12.4).  Note  that  the  constraint  normal  Vci  points  toward  the  interior  of  the  feasible 
region  at  each  point  on  the  boundary  of  the  circle.  By  inspection,  we  see  that  the  solution 
is  still  (—1,  — l)r  and  that  the  condition  (12.10)  holds  for  the  value  A^  =  |.  However, 
this  inequality-constrained  problem  differs  from  the  equality-constrained  problem  (12.9) 
of  Example  12.1  in  that  the  sign  of  the  Lagrange  multiplier  plays  a  significant  role,  as  we 
now  argue. 
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As  before,  we  conjecture  that  a  given  feasible  point  x  is  not  optimal  if  we  can  find  a 
small  step  s  that  both  retains  feasibility  and  decreases  the  objective  function  /  to  first  order. 
The  main  difference  between  problems  (12.9)  and  (12.18)  comes  in  the  handling  of  the 
feasibility  condition.  As  in  (12.13),  the  step  s  improves  the  objective  function,  to  first  order, 
if  V/(.r)ri  <  0.  Meanwhile,  s  retains  feasibility  if 

0  <  ci(.r  +  s)  Ri  ci(x)  +  Vci(v)r^, 
so,  to  first  order,  feasibility  is  retained  if 

ci(.r)  +  Vci(x)ri  >  0.  (12.19) 

In  determining  whether  a  step  s  exists  that  satisfies  both  (12.13)  and  (12.19),  we 
consider  the  following  two  cases,  which  are  illustrated  in  Figure  12.4. 

Case  I:  Consider  first  the  case  in  which  x  lies  strictly  inside  the  circle,  so  that  the  strict 
inequality  Ci(x)  >  0  holds.  In  this  case,  any  step  vector  s  satisfies  the  condition  (12.19), 
provided  only  that  its  length  is  sufficiently  small.  In  fact,  whenever  V/(.r)  ^  0,  we  can 
obtain  a  step  s  that  satisfies  both  (12.13)  and  (12.19)  by  setting 

^  =  —  ffV/(r), 


Figure  12.4  Improvement  directions  s  from  two  feasible  points  x  for  the  problem 
(12.18)  at  which  the  constraint  is  active  and  inactive,  respectively. 


312  Chapter  12.  Theory  of  Constrained  Optimization 


for  any  positive  scalar  a  sufficiently  small.  However,  this  definition  does  not  give  a  step  s 
with  the  required  properties  when 


V/(x)  =  0,  (12.20) 

Case  II:  Consider  now  the  case  in  which  x  lies  on  the  boundary  of  the  circle,  so  that 
C\(x)  =  0.  The  conditions  (12.13)  and  (12.19)  therefore  become 

V  f(x)Ts  <  0,  Vci(x)r.y  >  0. 

The  first  of  these  conditions  defines  an  open  half-space,  while  the  second  defines  a  closed 
half-space,  as  illustrated  in  Figure  12.5.  It  is  clear  from  this  figure  that  the  intersection  of 
these  two  regions  is  empty  only  when  V/(x)  and  Vci(x)  point  in  the  same  direction,  that 
is,  when 


V/(x)  =  AiVci(x),  for  some  li>0.  (12.21) 

Note  that  the  sign  of  the  multiplier  is  significant  here.  If  (12.10)  were  satisfied  with  a  negative 
value  of  A-i,  then  V/(x)  and  Vci(x)  would  point  in  opposite  directions,  and  we  see  from 
Figure  12.5  that  the  set  of  directions  that  satisfy  both  ( 12.13)  and  (12.19)  would  make  up  an 
entire  open  half-plane. 


Figure  12.5  A  direction  d  that  satisfies  both  (12.13)  and  (12.19)  lies  in  the 
intersection  of  a  closed  half-plane  and  an  open  half-plane. 
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The  optimality  conditions  for  both  cases  I  and  II  can  again  be  summarized  neatly 
with  reference  to  the  Lagrangian  function  C  defined  in  ( 12. 16) .  When  no  first-order  feasible 
descent  direction  exists  at  some  point  x*,  we  have  that 

V. xC{x*,  A*)  =  0,  for  some  X*  >  0,  (12.22) 


where  we  also  require  that 


^c1(r*)  =  0.  (12.23) 

Condition  (12.23)  is  known  as  a  complementarity  condition ;  it  implies  that  the  Lagrange 
multiplier  Ai  can  be  strictly  positive  only  when  the  corresponding  constraint  ci  is  active. 
Conditions  of  this  type  play  a  central  role  in  constrained  optimization,  as  we  see  in  the 
sections  that  follow.  In  case  I,  we  have  that  ci(x*)  >  0,  so  (12.23)  requires  that  =  0. 
Hence,  (12.22)  reduces  to  V /(x*)  =  0,  as  required  by  (12.20).  In  case  II,  (12.23)  allows  A* 
to  take  on  a  nonnegative  value,  so  ( 12.22)  becomes  equivalent  to  (12.21). 

TWO  INEQUALITY  CONSTRAINTS 


□  Example  1 2.3 

Suppose  we  add  an  extra  constraint  to  the  problem  (12.18)  to  obtain 

minxi+X2  s.t.  2  —  x[  —  x\  >  0,  X2  >  0,  (12.24) 

for  which  the  feasible  region  is  the  half-disk  illustrated  in  Figure  12.6.  It  is  easy  to  see  that 
the  solution  lies  at  (— -Jl,  0)T ,  a  point  at  which  both  constraints  are  active.  By  repeating  the 
arguments  for  the  previous  examples,  we  would  expect  a  direction  d  of  first-order  feasible 
descent  to  satisfy 


Vcj{x)Td  >  0,  i  el  =  (l,2),  V/(x)r<i<0.  (12.25) 

However,  it  is  clear  from  Figure  12.6  that  no  such  direction  can  exist  when  x  =  (— \/2,  0)r. 
The  conditions  Vci(x)Td  >  0,  i  —  1,2,  are  both  satisfied  only  if  d  lies  in  the  quadrant 
defined  by  Vci  (x )  and  V  C2  (x ) ,  but  it  is  clear  by  inspection  that  all  vectors  d  in  this  quadrant 
satisfy  V  f(x)Td  >  0. 

Let  us  see  how  the  Lagrangian  and  its  derivatives  behave  for  the  problem  (12.24)  and 
the  solution  point  (— ~/l,  0)r.  First,  we  include  an  additional  term  A .,-Cj  (x)  in  the  Lagrangian 
for  each  additional  constraint,  so  the  definition  of  C  becomes 


C(x,  X)  —  f(x)  —  AiCi(x)  —  X2C2(x), 
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Figure  12.6 

Problem  (12.24),  illustrating  the 
gradients  of  the  active  constraints 
and  objective  at  the  solution. 


where  X  —  (A.i,  X2)T  is  the  vector  of  Lagrange  multipliers.  The  extension  of  condition 
( 12.22)  to  this  case  is 

Vx£(.r*.  X*)  =  0,  for  some  X*  >  0,  (12.26) 

where  the  inequality  X*  >  0  means  that  all  components  of  X *  are  required  to  be  nonnegative. 
By  applying  the  complementarity  condition  ( 12.23)  to  both  inequality  constraints,  we  obtain 

A*Cl(x*)  =  0,  A.*c2(x*)  =  0.  (12.27) 

When  v*  =  (— \fl,  0)r,  we  have 

1  1  T  iVl  ’ 

V/(.r*)=  ,  VCl(v*)=  ,  Vc2(x 

1 J  L  0 

so  that  it  is  easy  to  verify  that  Vv£(x*,  X*)  =  0  when  we  select  X *  as  follows: 

1/(2  V2) 

1 

Note  that  both  components  of  X*  are  positive,  so  that  (12.26)  is  satisfied. 

We  consider  now  some  other  feasible  points  that  are  not  solutions  of  (12.24),  and 
examine  the  properties  of  the  Lagrangian  and  its  gradient  at  these  points. 

For  the  point  x  —  ( VX  0)r,  we  again  have  that  both  constraints  are  active  (see 
Figure  12.7).  However,  it  s  easy  to  identify  vectors  d  that  satisfies  (12.25):  d  =  (—1,  0)r 
is  one  such  vector  (there  are  many  others).  For  this  value  of  x  it  is  easy  to  verify  that  the 
condition  Vx£(x,  X)  =  0  is  satisfied  only  when  X  —  (— 1/(2\/2),  l)r.  Note  that  the  first 
component  Ai  is  negative,  so  that  the  conditions  (12.26)  are  not  satisfied  at  this  point. 

Finally,  we  consider  the  point  x  —  (1,  0)r,  at  which  only  the  second  constraint  c2  is 
active.  Since  any  small  step  s  away  from  this  point  will  continue  to  satisfy  c\ {x  +  s )  >  0,  we 
need  to  consider  only  the  behavior  of  c2  and  /  in  determining  whether  s  is  indeed  a  feasible 
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Figure  12.7 

Problem  (12.24),  illustrating 
the  gradients  of  the  active 
constraints  and  objective  at  a 
nonoptimal  point. 


descent  step.  Using  the  same  reasoning  as  in  the  earlier  examples,  we  find  that  the  direction 
of  feasible  descent  d  must  satisfy 


By  noting  that 


Vc2(x)rd  >  0,  V  f(x)Td  <  0. 


V/(jc)  = 


1 

1 


Vc2(x)  = 


(12.28) 


it  is  easy  to  verify  that  the  vector  d  —  (—  satisfies  (12.28)  and  is  therefore  a  descent 

direction. 

To  show  that  optimality  conditions  (12.26)  and  (12.27)  fail,  we  note  first  from  (12.27) 
that  since  Ci(x)  >  0,  we  must  have  Ai  =  0.  Therefore,  in  trying  to  satisfy  Vx£(x,  A.)  =  0, 
we  are  left  to  search  for  a  value  A2  such  that  V/(x)  —  A2Vc2(x)  =  0.  No  such  A2  exists,  and 
thus  this  point  fails  to  satisfy  the  optimality  conditions. 


1 2.2  TANGENT  CONE  AND  CONSTRAINT  QUALIFICATIONS 


In  this  section  we  define  the  tangent  cone  7h(x*)  to  the  closed  convex  set  Q  at  a  point 
x*  e  f2,  and  also  the  set  T(x*)  of  first-order  feasible  directions  at  x*.  We  also  discuss 
constraint  qualifications.  In  the  previous  section,  we  determined  whether  or  not  it  was 
possible  to  take  a  feasible  descent  step  away  from  a  given  feasible  point  x  by  examining 
the  first  derivatives  of  /  and  the  constraint  functions  c,- .  We  used  the  first-order  Taylor 
series  expansion  of  these  functions  about  x  to  form  an  approximate  problem  in  which  both 
objective  and  constraints  are  linear.  This  approach  makes  sense,  however,  only  when  the 
linearized  approximation  captures  the  essential  geometric  features  of  the  feasible  set  near 
the  point  x  in  question.  If,  near  x,  the  linearization  is  fundamentally  different  from  the 
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feasible  set  (for  instance,  it  is  an  entire  plane,  while  the  feasible  set  is  a  single  point)  then 
we  cannot  expect  the  linear  approximation  to  yield  useful  information  about  the  original 
problem.  Hence,  we  need  to  make  assumptions  about  the  nature  of  the  constraints  c,-  that 
are  active  at  x  to  ensure  that  the  linearized  approximation  is  similar  to  the  feasible  set,  near 
x.  Constraint  qualifications  are  assumptions  that  ensure  similarity  of  the  constraint  set  £2 
and  its  linearized  approximation,  in  a  neighborhood  of  x*. 

Given  a  feasible  point  x,  we  call  {zc-}  a  feasible  sequence  approaching  x  if  Zk  e  f2  for  all 
k  sufficiently  large  and  Zk  x. 

Later,  we  characterize  a  local  solution  of  (12.1)  as  a  point  x  at  which  all  feasible 
sequences  approaching  x  have  the  property  that  f(zk)  >  f(x)  for  all  k  sufficiently  large, 
and  we  will  derive  practical,  verifiable  conditions  under  which  this  property  holds.  We  lay 
the  groundwork  in  this  section  by  characterizing  the  directions  in  which  we  can  step  away 
from  x  while  remaining  feasible. 

A  tangent  is  a  limiting  direction  of  a  feasible  sequence. 

Definition  12.2. 

The  vector  d  is  said  to  be  a  tangent  (or  tangent  vector )  to  Q,  at  a  point  x  if  there  are  a 
feasible  sequence  {ik\  approaching  x  and  a  sequence  of  positive  scalars  {4}  with  4  -a  0  such 
that 


lim  — — -=d.  (12.29) 

k-t-oo  t/c 

The  set  of  all  tangents  to  £2  at  x*  is  called  the  tangent  cone  and  is  denoted  by  Tq(x*). 

It  is  easy  to  see  that  the  tangent  cone  is  indeed  a  cone,  according  to  the  definition 
(A.36).  lid  is  a  tangent  vector  with  corresponding  sequences  {zif  and  {4},  then  by  replacing 
each  4  by  ff-1 4,  for  any  a  >  0,  we  find  that  old  e  Tq(x*)  also.  We  obtain  that  0  e  Tq(x) 
by  setting  Zk  =  x  in  the  definition  of  feasible  sequence. 

We  turn  now  to  the  linearized  feasible  direction  set,  which  we  define  as  follows. 

Definition  12.3. 

Given  a  feasible  point  x  and  the  active  constraint  set  A(x)  of  Definition  12.1,  the  set  of 
linearized  feasible  directions  !F(x)  is 


cFVcfix )  =  0,  for  all  i  e  £, 

dTVa(x)  >  0,  for  all  i  e  A(x)  fl  1 

As  with  the  tangent  cone,  it  is  easy  to  verify  that  lF(x)  is  a  cone,  according  to  the  definition 
(A.36). 

It  is  important  to  note  that  the  definition  of  tangent  cone  does  not  rely  on  the  algebraic 
specification  of  the  set  £2,  only  on  its  geometry.  The  linearized  feasible  direction  set  does, 
however,  depend  on  the  definition  of  the  constraint  functions  c, ,  i  e  £  U I. 
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Figure  12.8 

Constraint  normal,  objective 
gradient,  and  feasible 
sequence  for  problem  (12.9). 


We  illustrate  the  tangent  cone  and  the  linearized  feasible  direction  set  by  revisiting 
Examples  12.1  and  12.2. 


□  Example  12.4  (Example  12.1,  Revisited) 

Figure  12.8  shows  the  problem  (12.9),  the  equality-constrained  problem  in  which  the 
feasible  set  is  a  circle  of  radius  \fl,  near  the  nonoptimal  point  x  —  (— \/2,  0)r.  The  figure 
also  shows  a  feasible  sequence  approaching  x.  This  sequence  could  be  defined  analytically 
by  the  formula 


Zk 


-y/2  -  1  /k2 
-l/k 


(12.30) 


By  choosing  4  =  ||z*  —  x\\,  we  find  that  d  —  (0,  —  l)r  is  a  tangent.  Note  that  the  objective 
function  f(x)  =  Xi  +  X2  increases  as  we  move  along  the  sequence  (12.30);  in  fact,  we  have 

f(Zk+  i)  >  f(Zk)  for  all  k  =  2,3, _ It  follows  that  f{zk)  <  fix )  for  k  —  2,  3, . . .,  so  x 

cannot  be  a  solution  of  (12.9). 

Another  feasible  sequence  is  one  that  approaches  x  —  (— *Jl,  0)r  from  the  opposite 
direction.  Its  elements  are  defined  by 


Zk 


-y/2-  l/k2 
l/k 


It  is  easy  to  show  that  /  decreases  along  this  sequence  and  that  the  tangents  corresponding 
to  this  sequence  are  d  —  (0,  a)T .  In  summary,  the  tangent  cone  at  x  =  (— ~Jl,  0)T  is 
{(0,  dj)T  |  di  £  R}. 
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For  the  definition  (12.9)  of  this  set,  and  Definition  12.3,  we  have  that  d  —  (di,d2)T  e 
T{x)  if 


0  =  Vci{x)T  d  = 


1 

£ 

_ 1 

T 

1 

1 _ 

L  2x2 

- 1 

^3 

—2\fld\. 


Therefore,  we  obtain  T(x)  —  {(0,  d2)T  \  d2  e  R}.  In  this  case,  we  have  Tq(x)  =  T{x). 
Suppose  that  the  feasible  set  is  defined  instead  by  the  formula 

£2  =  {x  |  C\{x)  =  0},  where  C\(x)  —  {x{  +  x\  —  2)2  =  0.  (12.31) 

(Note  that  C2  is  the  same,  but  its  algebraic  specification  has  changed.)  The  vector  d  belongs 
to  the  linearized  feasible  set  if 


0  =  Vc\{x)T  d  — 


4(xf  +  *2  —  2)x\ 

T 

d\ 

0 

T 

d\ 

4{x\  +  x\  —  2)x2 

d2 

0 

which  is  true  for  all  (d i,  d2)T ■  Hence,  we  have  T{x)  —  R“,  so  for  this  algebraic  specification 
of  C2,  the  tangent  cone  and  linearized  feasible  sets  differ. 


□  Example  12.5  (Example  12.2,  Revisited) 

We  now  reconsider  problem  (12.18)  in  Example  12.2.  The  solution  x  —  (— 1,  —  l)r  is 
the  same  as  in  the  equality- constrained  case,  but  there  is  a  much  more  extensive  collection 
of  feasible  sequences  that  converge  to  any  given  feasible  point  (see  Figure  12.9). 


/V  L.' 

V 

\x2 

X1 

\  ^ 

Figure  12.9 

Feasible  sequences  converging  to  a  particular 
feasible  point  for  the  region  defined  by 

x\  +  xj  <2. 
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From  the  point  x  —  (— *Jl,  0)T ,  the  various  feasible  sequences  defined  above  for  the 
equality-constrained  problem  are  still  feasible  for  (12.18).  There  are  also  infinitely  many 
feasible  sequences  that  converge  to  x  —  (— a/2,  0)T  along  a  straight  line  from  the  interior  of 
the  circle.  These  sequences  have  the  form 

Zk  —  (->/2,  0)t  +  (1  /k)w. 


where  w  is  any  vector  whose  first  component  is  positive  (v)\  >  0).  The  point  Zk  is  feasible 
provided  that  ||z*||  <  a/2,  that  is, 

(— a/2  +  Wi/k)2  +  {w2/ k)~  <  2, 


which  is  true  when  k  >  (wj  +  w\)/{2\/2w\).  In  addition  to  these  straight-line  feasible 
sequences,  we  can  also  define  an  infinite  variety  of  sequences  that  approach  (— a/2,  0)r 
along  a  curve  from  the  interior  of  the  circle.  To  summarize,  the  tangent  cone  to  this  set  at 
(-a/2,  0)t  is  {(u>i,  w2)t  I  Wi  >  0}. 

For  the  definition  (12.18)  of  this  feasible  set,  we  have  from  Definition  12.3  that 
d  e  LF(x)  if 


0  <  Vc/xyd  — 


— 2xi 

T 

'  dx  ' 

-2x2 

d2 

=  2\fld\. 


Hence,  we  obtain  T{x)  —  Tq(x)  for  this  particular  algebraic  specification  of  the  feasible 

set.  _ 

□ 


Constraint  qualifications  are  conditions  under  which  the  linearized  feasible  set  T(x) 
is  similar  to  the  tangent  cone  Tq(x).  In  fact,  most  constraint  qualifications  ensure  that  these 
two  sets  are  identical.  As  mentioned  earlier,  these  conditions  ensure  that  the  T{x),  which  is 
constructed  by  linearizing  the  algebraic  description  of  the  set  Q  at  x,  captures  the  essential 
geometric  features  of  the  set  £2  in  the  vicinity  of  x,  as  represented  by  Tq(x). 

Revisiting  Example  12.4,  we  see  that  both  Tq(x)  and  T(x)  consist  of  the  vertical  axis, 
which  is  qualitatively  similar  to  the  set  £2  —  {x}  in  the  neighborhood  of  x.  As  a  further 
example,  consider  the  constraints 


Ci(x)  =  1  —  xf  —  (X2  —  l)2  >  0,  c2(x)  —  —X2  >  0,  (12.32) 

for  which  the  feasible  set  is  the  single  point  £2  =  {(0,  0)r}  (see  Figure  12.10).  For  this  point 
x  =  (0,  0)r,  it  is  obvious  that  that  tangent  cone  is  7q(x)  =  {(0,  0)r},  since  all  feasible 
sequences  approaching  x  must  have  Zk  —  x  =  (0,  0)r  for  all  k  sufficiently  large.  Moreover, 
it  is  easy  to  show  that  linearized  approximation  to  the  feasible  set  LF(x)  is 


T{x*)  =  {(d1;  0)r  |  di  e  R), 
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Figure  12.10  Problem  (12.32),  for  which  the  feasible  set  is  the  single  point  of 
intersection  between  circle  and  line. 

that  is,  the  entire  horizontal  axis.  In  this  case,  the  linearized  feasible  direction  set  does  not 
capture  the  geometry  of  the  feasible  set,  so  constraint  qualifications  are  not  satisfied. 

The  constraint  qualification  most  often  used  in  the  design  of  algorithms  is  the  subject 
of  the  next  definition. 

Definition  12.4  (LICQ). 

Given  the  point  x  and  the  active  set  A(x)  defined  in  Definition  12.1,  we  say  that  the  linear 
independence  constraint  qualification  (LICQ)  holds  if  the  set  of  active  constraint  gradients 
{Vcj(x),i  e  A(x)}  is  linearly  independent. 

Note  that  this  condition  is  not  satisfied  for  the  examples  (12.32)  and  (12.31).  In  general,  if 
LICQ  holds,  none  of  the  active  constraint  gradients  can  be  zero.  We  mention  other  constraint 
qualifications  in  Section  12.6. 


1 2.3  FIRST-ORDER  OPTIMALITY  CONDITIONS 

In  this  section,  we  state  first-order  necessary  conditions  for  x *  to  be  a  local  minimizer 
and  show  how  these  conditions  are  satisfied  on  a  small  example.  The  proof  of  the  result  is 
presented  in  subsequent  sections. 

As  a  preliminary  to  stating  the  necessary  conditions,  we  define  the  Lagrangian  function 
for  the  general  problem  ( 12.1). 

C(x,X)—f(x)—  A.;C;(x).  (12.33) 

ie£  UX 


(We  had  previously  defined  special  cases  of  this  function  for  the  examples  of  Section  12.1.) 
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The  necessary  conditions  defined  in  the  following  theorem  are  called  first-order  con- 
ditions  because  they  are  concerned  with  properties  of  the  gradients  (first-derivative  vectors) 
of  the  objective  and  constraint  functions.  These  conditions  are  the  foundation  for  many  of 
the  algorithms  described  in  the  remaining  chapters  of  the  book. 

Theorem  12.1  (First-Order  Necessary  Conditions). 

Suppose  that  x*  is  a  local  solution  of  (12.1),  that  the  functions  f  and  Ci  in  (12.1)  are 
continuously  differentiable,  and  that  the  LICQ  holds  atx* .  Then  there  is  a  Lagrange  multiplier 
vector  A*,  with  components  A*,  i  e  8  U  1,  such  that  the  following  conditions  are  satisfied  at 
(x*,A*) 


Vx£(x*,A*) 

=  0, 

(12.34a) 

c,(x *) 

=  0, 

for  all  i 

e  £, 

(12.34b) 

Ci(x*) 

>  o, 

for  all  i 

el, 

(12.34c) 

A* 

>  o, 

for  all  i 

el, 

(12.34d) 

A*c,(x*) 

=  0, 

for  all  i 

e  EG  1. 

(12.34e) 

The  conditions  (12.34)  are  often  known  as  the  Karush-Kuhn-Tucker  conditions,  or 
KKT  conditions  for  short.  The  conditions  (12.34e)  are  complementarity  conditions;  they 
imply  that  either  constraint  i  is  active  or  A*  =  0,  or  possibly  both.  In  particular,  the 
Lagrange  multipliers  corresponding  to  inactive  inequality  constraints  are  zero,  we  can  omit 
the  terms  for  indices  i  f  A{x*)  from  (12.34a)  and  rewrite  this  condition  as 

0=  VaX(x*,A*)  =  V/Oc*)-  K^Ciix*).  (12.35) 

ieA(x*) 


A  special  case  of  complementarity  is  important  and  deserves  its  own  definition. 

Definition  12.5  (Strict  Complementarity). 

Given  a  local  solution  x *  of  (12.1 )  and  a  vector  A*  satisfying  ( 12.34),  we  say  that  the 
strict  complementarity  condition  holds  ;/ exactly  one  of  A*  and  c,(x*)  is  zero  for  each  index 
i  e  1.  In  other  words,  we  have  that  A*  >  0  for  each  i  el  C I  A(x*). 

Satisfaction  of  the  strict  complementarity  property  usually  makes  it  easier  for  algorithms  to 
determine  the  active  set  _4(x*)  and  converge  rapidly  to  the  solution  x*. 

For  a  given  problem  (12.1)  and  solution  point  x* ,  there  may  be  many  vectors  A*  for 
which  the  conditions  (12.34)  are  satisfied.  When  the  LICQ  holds,  however,  the  optimal  A* 
is  unique  (see  Exercise  12.17). 

The  proof  of  Theorem  12. 1  is  quite  complex,  but  it  is  important  to  our  understanding 
of  constrained  optimization,  so  we  present  it  in  the  next  section.  First,  we  illustrate  the  KKT 
conditions  with  another  example. 
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Figure  12.11  Inequality-constrained  problem  (12.36)  with  solution  at  (1,  0)r. 


□  Example  1 2.6 

Consider  the  feasible  region  illustrated  in  Figure  12.2  and  described  by  the  four 
constraints  (12.6).  By  restating  the  constraints  in  the  standard  form  of  (12.1)  and  including 
an  objective  function,  the  problem  becomes 


min 

X 


S.t. 


1  —  X1—X2 
1  -  XI  +  x2 

1  +  X\  —  X2 
1  +  X\  +  X2 


>  0. 


(12.36) 


It  is  fairly  clear  from  Figure  12.11  that  the  solution  is  x*  —  (1,  0)r.  The  first  and  second 
constraints  in  (12.36)  are  active  at  this  point.  Denoting  them  by  c  1  and  c2  (and  the  inactive 
constraints  by  C3  and  c4),  we  have 


V/(x*)  = 

-1 

1 

,  Vd(jc*)  = 

-1 

-1 

,  Vc2(x*)  = 

-1 

1 

2  _ 

Therefore,  the  KKT  conditions  (12.34a)-(12.34e)  are  satisfied  when  we  set 


□ 


**  =  (U>o,o)r. 


12.4. 
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1 2.4  FIRST-ORDER  OPTIMALITY  CONDITIONS:  PROOF 


We  now  develop  a  proof  of  Theorem  12.1.  A  number  of  key  subsidiary  results  are  required, 
so  the  development  is  quite  long.  However,  a  complete  treatment  is  worthwhile,  since  these 
results  are  so  fundamental  to  the  field  of  optimization. 


RELATING  THE  TANGENT  CONE  AND  THE  FIRST-ORDER  FEASIBLE 
DIRECTION  SET 

The  following  key  result  uses  a  constraint  qualification  (LICQ)  to  relate  the  tangent 
cone  of  Definition  12.2  to  the  set  T  of  first-order  feasible  directions  of  Definition  12.3.  In 
the  proof  below  and  in  later  results,  we  use  the  notation  A  (x* )  to  represent  the  matrix  whose 
rows  are  the  active  constraint  gradients  at  the  optimal  point,  that  is, 

A(x*)t  =  [Vci(jc*)]/6^.),  (12.37) 

where  the  active  set  „4(x*)  is  defined  as  in  Definition  12.1. 


Lemma  12.2. 

Let  x*  be  a  feasible  point.  The  following  two  statements  are  true. 

(i)  7h(x*)  C  W). 

(ii)  If  the  LICQ  condition  is  satisfied  atx*,  thenT(x*)  —  Tq(x*). 


Proof.  Without  loss  of  generality,  let  us  assume  that  all  the  constraints  <:,(•),  i  — 
1,2 are  active  atx*.  (We  can  arrive  at  this  convenient  ordering  by  simply  dropping  all 
inactive  constraints — which  are  irrelevant  in  some  neighborhood  of  x* — and  renumbering 
the  active  constraints  that  remain.) 

To  prove  (i),  let  {z*-}  and  {4}  be  the  sequences  for  which  (12.29)  is  satisfied,  that  is, 

lim  U  ~  A  =  d. 

k->OQ  tk 

(Note  in  particular  that  4  >  0  for  all  k .)  From  this  definition,  we  have  that 

Zk  —  x*  +  tkd  +  0(4). 

By  taking  i  e  £  and  using  Taylor’s  theorem,  we  have  that 

0  =  — cfizk ) 

4 

=  —  [c;(x*)  +  tkVci(x*)Td  +  0(4)] 
tk 


=  Vc,(x*)7'd  + 


o(tk) 

tk 


(12.38) 
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By  taking  the  limit  as  k  — >  oo,  the  last  term  in  this  expression  vanishes,  and  we  have 
Vci(x*)Td  —  0,  as  required.  For  the  active  inequality  constraints  i  e  Zl(x*)  HI,  we  have 
similarly  that 


0  < 


-Ci(Zk) 

4 


=  -  [c,(x*)  +  tkVa{x*)Td  +  o(tk )] 

4 


=  Vci{x*)T  d  + 


o(tk) 

tk 


Hence,  by  a  similar  limiting  argument,  we  have  that  Vc,  {x*)Td  >  0,  as  required. 

For  (ii),  we  use  the  implicit  function  theorem  (see  the  Appendix  or  Lang  [187,  p.  131] 
for  a  statement  of  this  result).  First,  since  the  LICQ  holds,  we  have  from  Definition  12.4  that 
the  m  x  n  matrix  A  (x*)  of  active  constraint  gradients  has  full  row  rank  m.  Let  Z  be  a  matrix 
whose  columns  are  a  basis  for  the  null  space  of  A(x*);  that  is, 


Z  E  E"x*"  m\  Z  has  full  column  rank,  A(x*)Z  —  0.  (12.39) 


(See  the  related  discussion  in  Chapter  16.)  Choose  d  e  ZF{x*)  arbitrarily,  and  suppose  that 
{4}£10  is  any  sequence  of  positive  scalars  such  lim^oo  4  =  0.  Define  the  parametrized 
system  of  equations  R  :  R"  x  R  ->  R”  by 


c(z)  —  tA(x*)d 

0 

N 

h 

1 

X 

* 

1 

1 _ 

0 

(12.40) 


We  claim  that  the  solutions  z  —  Zk  of  this  system  for  small  t  —  4  >  0  give  a  feasible  sequence 
that  approaches  x*  and  satisfies  the  definition  (12.29). 

At  t  —  0,  z  =  x*,  and  the  Jacobian  of  R  at  this  point  is 


V-tf(x*,0) 


A(x*) 

ZT 


(12.41) 


which  is  nonsingular  by  construction  of  Z.  Hence,  according  to  the  implicit  function 
theorem,  the  system  (12.40)  has  a  unique  solution  zk  for  all  values  of  4  sufficiently  small. 
Moreover,  we  have  from  (12.40)  and  Definition  12.3  that 


i  e  £  =>•  Ci(zk)  —  tkVcj{x*)T d  =  0, 
i  e  .4(x*)  ni  =>  Cj(zk)  —  tkVci{x*)T d  >  0, 


so  that  zk  is  indeed  feasible. 


(12.42a) 

(12.42b) 


12.4. 


First-Order  Optimality  Conditions:  Proof  325 


It  remains  to  verify  that  (12.29)  holds  for  this  choice  of  {z^}.  Using  the  fact  that 
R(Zk,  tk)  —  0  for  all  k  together  with  Taylor’s  theorem,  we  find  that 


0  =  R(zk,  tk)  - 


c{zk )  -  tkA(x*)d 
ZT(Zk  -  x*  -  tkd) 

A{x*)(Zk  ~  x*)  +  o(\\zk  -  jc*||)  -  tkA{x*)d 
ZT{zk  x*  tkd ) 

A(x*,  " 

( Zk  -  x*  -  tkd)  +  o{\\zk  -  x* 


By  dividing  this  expression  by  tk  and  using  nonsingularity  of  the  coefficient  matrix  in  the 
first  term,  we  obtain 


Zk  -  X* 


tk 


from  which  it  follows  that  (12.29)  is  satisfied  (for  x  —  x*).  Hence,  d  e  Tq(x*)  for  an 
arbitrary  d  e  T(x*),  so  the  proof  of  (ii)  is  complete.  □ 


A  FUNDAMENTAL  NECESSARY  CONDITION 

As  mentioned  above,  a  local  solution  of  ( 1 2 . 1 )  is  a  point  x  at  which  all  feasible  sequences 
have  the  property  that  f(zk)  >  f(x)  for  all  k  sufficiently  large.  The  following  result  shows 
that  if  such  a  sequence  exists,  then  its  limiting  directions  must  make  a  nonnegative  inner 
product  with  the  objective  function  gradient. 

Theorem  12.3. 

Ifx*  is  a  local  solution  of  (12.1 ),  then  we  have 

V  f(x*)Td  >  0,  for  all  d  e  Tq(x*).  (12.43) 

Proof.  Suppose  for  contradiction  that  there  is  a  tangent  d  for  which  V  f(x*)T d  <  0.  Let 
{za-}  and  {4}  be  the  sequences  satisfying  Definition  12.2  for  this  d.  We  have  that 


f(Zk)  =  fix*)  +  {Zk  -  x*)TV  f(x*)  +  o(\\Zk  -  x*  j| ) 
=  f(x*)  +  tkdTVf(x*)  +  o(tk), 
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Figure  12.12 

Problem  (12.44),  showing  various 
limiting  directions  of  feasible 
sequences  at  the  point  (0,  0)r. 


where  the  second  line  follows  from  (12.38).  Since  dTV  f(x*)  <  0,  the  remainder  term  is 
eventually  dominated  by  the  first-order  term,  that  is, 

f(Zk)  <  f(x*)+  \tkdTV  f(x*),  for  all  k  sufficiently  large. 

Hence,  given  any  open  neighborhood  of  x*,  we  can  choose  k  sufficiently  large  that  Zk  lies 
within  this  neighborhood  and  has  a  lower  value  of  the  objective  /.  Therefore,  x*  is  not  a 
local  solution.  □ 

The  converse  of  this  result  is  not  necessarily  true.  That  is,  we  may  have  V  f(x*  )T d  >  0 
for  all  d  e  7h(jc*),  yet  x*  is  not  a  local  minimizer.  An  example  is  the  following  problem  in 
two  unknowns,  illustrated  in  Figure  12.12 

min  x2  subject  to  *2  >  —x\.  (12.44) 

This  problem  is  actually  unbounded,  but  let  us  examine  its  behavior  at  x*  =  (0,  0)T .  It  is 
not  difficult  to  show  that  all  limiting  directions  d  of  feasible  sequences  must  have  d2  >  0,  so 
that  V  f(x*)Td  =  d2  >  0.  However,  x*  is  clearly  not  a  local  minimizer;  the  point  (a,  —a2)T 
for  a  >0  has  a  smaller  function  value  than  x*,  and  can  be  brought  arbitrarily  close  to  x* 
by  setting  a  sufficiently  small. 


FARKAS’  LEMMA 

The  most  important  step  in  proving  Theorem  12.1  is  a  classical  theorem  of  the 
alternative  known  as  Farkas’  Lemma.  This  lemma  considers  a  cone  K  defined  as  follows: 


K  =  {By  +  Cw  |  y  >  0}, 


(12.45) 
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(right). 


where  B  and  C  are  matrices  of  dimension  n  x  m  and  n  x  p,  respectively,  and  y  and  w  are 
vectors  of  appropriate  dimensions.  Given  a  vector  g  e  R",  Farkas’  Lemma  states  that  one 
(and  only  one)  of  two  alternatives  is  true.  Either  g  e  K,  or  else  there  is  a  vector  d  e  R"  such 
that 


gT d  <  0,  BT d  >  0,  CTd  =  0.  (12.46) 

The  two  cases  are  illustrated  in  Figure  12.13  for  the  case  of  B  with  three  columns,  C  null, 
and  n  —  2.  Note  that  in  the  second  case,  the  vector  d  defines  a  separating  hyperplane,  which 
is  a  plane  in  R"  that  separates  the  vector  g  from  the  cone  K . 

Lemma  12.4  (Farkas). 

Let  the  cone  K  be  defined  as  in  (12.45).  Given  any  vector  g  e  R",  we  have  either  that 
g  e  K  or  that  there  exists  d  e  R"  satisfying  (12.46),  but  not  both. 

Proof.  We  show  first  that  the  two  alternatives  cannot  hold  simultaneously.  If  g  e  K,  there 
exist  vectors  y  >  0  and  w  such  that  g  —  By  +  C  w.  If  there  also  exists  a  d  with  the  property 
(12.46),  we  have  by  taking  inner  products  that 

0  >  dT g  —  dT By  +  dT Cw  —  (BT d)T y  +  (CT d)T w  >  0, 

where  the  final  inequality  follows  from  CTd  —  0,  BT d  >  0,  and  y  >  0.  Hence,  we  cannot 
have  both  alternatives  holding  at  once. 

We  now  show  that  one  of  the  alternatives  holds.  To  be  precise,  we  show  how  to 
construct  d  with  the  properties  (12.46)  in  the  case  that  g  f  K.  For  this  part  of  the  proof, 
we  need  to  use  the  property  that  A"  is  a  dosed  set — a  fact  that  is  intuitively  obvious  but  not 
trivial  to  prove  (see  Lemma  12.15  in  the  Notes  and  References  below).  Let  s  be  the  vector 
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in  K  that  is  closest  to  g  in  the  sense  of  the  Euclidean  norm.  Because  K  is  closed,  s  is  well 
defined  and  is  given  by  the  solution  of  the  following  optimization  problem: 

min  ||s  —  g || 2  subject  to  s  e  K.  (12.47) 

Since  s  e  K,  we  have  from  the  fact  that  K  is  a  cone  that  as  e  K  for  all  scalars  a  >  0.  Since 
||as  —  ^  || |  is  minimized  by  a  —  1,  we  have  by  simple  calculus  that 

=  0  =Y  (-2 sTg  +  2fF5)|a=1  =  0 

<2=1 

F(.s  -  g)  =  0.  (12.48) 

Now,  let  s  be  any  other  vector  in  K.  Since  K  is  convex,  we  have  by  the  minimizing  property 
of  s  that 


s+6(s  -s)-  g |||  >  ||5  -  g\\22  for  all  6  e  [0,  1], 


and  hence 


2 6{s  —  s)T(s  —  g)  +  02\\s  —  5 1||  >  0. 

By  dividing  this  expression  by  9  and  taking  the  limit  as  0  4-  0,  we  have  (s  —  s)T (s  —  g)  >  0. 
Therefore,  because  of  (12.48), 

sT(s  —  g)>  0,  forall.se  K.  (12.49) 

We  claim  now  that  the  vector 


d  —s  —  g 

satisfies  the  conditions  ( 12.46).  Note  that  c!  /  0  because  g  £  K.  We  have  from  ( 12.48)  that 

dT g  —  clT{s  —  d)  —  (s  —  g)Ts  —  dTd  —  —  ||z/|||  <  0, 

so  that  d  satisfies  the  first  property  in  (12.46). 

From  (12.49),  we  have  that  dTs  >  0  for  all  ,s  e  K,  so  that 

dT (By  +  Cw)  >  0  for  all  y  >  0  and  all  w. 

By  fixing  y  —  0  we  have  that  ( CTd)Tw  >  0  for  all  w,  which  is  true  only  if  CT d  —  0.  By 
fixing  w  —  0,  we  have  that  (BTd)Ty  >  0  for  ally  >  0,  which  is  true  only  if  BJ  d  >  0.  Hence, 
d  also  satisfies  the  second  and  third  properties  in  (12.46)  and  our  proof  is  complete.  □ 
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By  applying  Lemma  12.4  to  the  cone  N  defined  by 


N  = 


*»Vcj(jc*), 

ieA(x*) 


Xi  >  0  for  i  e  A(x*)  n  X 


(12.50) 


and  setting  g  =  V/(x*),  we  have  that  either 

V/(x*)  =  A.,Vc,(x*)  =  A(x*)rA*,  Xj  >  0  for  i  e  A(x*)  Cl  X,  (12.51) 

ieA(x*) 


or  else  there  is  a  direction  d  such  that  dT  V/(x*)  <  0  and  d  e  LF(x*). 


PROOF  OF  THEOREM  12.1 


Lemmas  12.2  and  12.4  can  be  combined  to  give  the  KKT  conditions  described  in 
Theorem  12.1.  We  work  through  the  final  steps  of  the  proof  here.  Suppose  that  x*  e  R"  is  a 
feasible  point  at  which  the  LICQ  holds.  The  theorem  claims  that  if  x*  is  a  local  solution  for 
(12.1),  then  there  is  a  vector  X*  e  Rm  that  satisfies  the  conditions  (12.34). 

We  show  first  that  there  are  multipliers  X/,  i  e  ^4(x*),  such  that  (12.51)  is  satisfied. 
Theorem  12.3  tells  us  that  t/rV/(x*)  >  0  for  all  tangent  vectors  d  e  Tq(x*).  From 
Lemma  12.2,  since  LICQ  holds,  we  have  that  7h(x*)  =  LF(x*).  By  putting  these  two 
statements  together,  we  findthatd7  V/(x*)  >  Oforalld  e  LF(x*).Hence,fromLemmal2.4, 
there  is  a  vector  X  for  which  (12.51)  holds,  as  claimed. 

We  now  define  the  vector  X *  by 


|  Xj,  i  €  >4(x*), 

{  0,  i  e  X\A(x*), 


(12.52) 


and  show  that  this  choice  of  A.*,  together  with  our  local  solution  x*,  satisfies  the  conditions 
(12.34).  We  check  these  conditions  in  turn. 

•  The  condition  (12.34a)  follows  immediately  from  ( 12.51)  and  the  definitions  ( 12.33) 
of  the  Lagrangian  function  and  (12.52)  of  X*. 

•  Since  x*  is  feasible,  the  conditions  (12.34b)  and  (12.34c)  are  satisfied. 

•  We  have  from  (12.51)  that  X*  >  0  for  i  e  ^4(x*)  fl  X,  while  from  ( 12.52),  Xf  =  0  for 

i  e  Z\^4(x*).  Hence,  Xf  >  0  for  i  e  X,  so  that  (12.34d)  holds. 

•  We  have  for  i  e  ^l(x*)  fl  X  that  c,  (x*)  =  0,  while  for  i  e  Z\.4(x*),  we  have  Xf  —  0. 

Hence  A.*c,  (x*)  =  0  for  i  e  X ,  so  that  (12.34e)  is  satisfied  as  well. 


The  proof  is  complete. 
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12.5  SECOND-ORDER  CONDITIONS 

So  far,  we  have  described  first-order  conditions — the  KKT  conditions — which  tell  us  how 
the  first  derivatives  of  /  and  the  active  constraints  c(-  are  related  to  each  other  at  a  solution  x*. 
When  these  conditions  are  satisfied,  a  move  along  any  vector  w  from  T{x*  )  either  increases 
the  first-order  approximation  to  the  objective  function  (that  is,  wTV  f(x*)  >  0),  or  else 
keeps  this  value  the  same  (that  is,  wT  V/(x*)  =  0). 

What  role  do  the  second  derivatives  of  /  and  the  constraints  c,-  play  in  optimality 
conditions?  We  see  in  this  section  that  second  derivatives  play  a  “tiebreaking”  role.  For 
the  directions  w  e  /F(x*)  for  which  wTV  f{x*)  —  0,  we  cannot  determine  from  first 
derivative  information  alone  whether  a  move  along  this  direction  will  increase  or  decrease 
the  objective  function  /.  Second- order  conditions  examine  the  second  derivative  terms  in 
the  Taylor  series  expansions  of  /  and  c,- ,  to  see  whether  this  extra  information  resolves  the 
issue  of  increase  or  decrease  in  /.  Essentially,  the  second- order  conditions  concern  the 
curvature  of  the  Lagrangian  function  in  the  “undecided”  directions — the  directions  w  e 
T(x*)  for  which  wTV  f(x*)  —  0. 

Since  we  are  discussing  second  derivatives,  stronger  smoothness  assumptions  are 
needed  here  than  in  the  previous  sections.  For  the  purpose  of  this  section,  /  and  c,-, 
i  e  fill,  are  all  assumed  to  be  twice  continuously  differentiable. 

Given  !F{x*)  from  Definition  12.3  and  some  Lagrange  multiplier  vector  X*  satisfying 
the  KKT  conditions  (12.34),  we  define  the  critical  coneC(x *,  A*)  as  follows: 

C{x*,  X*)  —  {w  e  !F{x*)  \  'Vci(x*)Tw  —  0,  all  i  e  A{x*)  fl  X  with  X*  >  0}. 

Equivalently, 

Vc/(x*)Tw  =  0,  for  all  i  e  £, 

weC(x*,X*)  O  Vcj{x*)Tw  =  0,  for  all  i  e  A(x*)  D  X  with  X*  >  0,  (12.53) 

Vcj{x*)Tw  >  0,  for  all  /  e  _4(x*)  fl  X  with  X*  —  0. 

The  critical  cone  contains  those  directions  w  that  would  tend  to  “adhere”  to  the  active 
inequality  constraints  even  when  we  were  to  make  small  changes  to  the  objective  (those 
indices  i  e  X  for  which  the  Lagrange  multiplier  component  X*  is  positive),  as  well  as  to  the 
equality  constraints.  From  the  definition  (12.53)  and  the  fact  that  X*  =  0  for  all  inactive 
components  i  e  I\„4(x*),  it  follows  immediately  that 

weC(x*,X*)  =Y  X*Vci(x*)T  w  —  0  for  all  /  e  £  UX.  (12.54) 

Hence,  from  the  first  KKT  condition  (12.34a)  and  the  definition  (12.33)  of  the  Lagrangian 
function,  we  have  that 

w  e  C(x*,  X*)  =►  wr  V/(x*)  =  K wTVa(x *)  =  0. 

ieS  ux 


(12.55) 
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Figure  12.14 

Problem  (12.56),  showing 
T(x*)  and  C(x*,  A.*). 


Hence  the  critical  cone  C(x* ,  X*)  contains  directions  from  T{x*)  for  which  it  is  not  clear 
from  first  derivative  information  alone  whether  /  will  increase  or  decrease. 


□  Example  1 2.7 

Consider  the  problem 

min  X\  subject  to  *2  >  0,  1  —  (x\  —  l)2  —  x\  >  0,  (12.56) 

illustrated  in  Figure  12.14.  It  is  not  difficult  to  see  that  the  solution  is  x*  —  (0,  0)r,  with 
active  set  A(x*)  —  {1,2}  and  a  unique  optimal  Lagrange  multiplier  X *  =  (0,  0.5)r.  Since 
the  gradients  of  the  active  constraints  at  x*  are  (0,  l)r  and  (2,  0)r,  respectively,  the  LICQ 
holds,  so  the  optimal  multiplier  is  unique.  The  linearized  feasible  set  is  then 

T{x*)  =  {d\d  >  0}, 


while  the  critical  cone  is 


□ 


C(x*,A*)  =  {(0,  w2)r|w2>0}. 
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The  first  theorem  defines  a  necessary  condition  involving  the  second  derivatives:  If 
x *  is  a  local  solution,  then  the  Hessian  of  the  Lagrangian  has  nonnegative  curvature  along 
critical  directions  (that  is,  the  directions  in  C(x*,  X*)). 

Theorem  12.5  (Second-Order  Necessary  Conditions). 

Suppose  thatx*  is  a  local  solution  of  (12.1 )  and  that  the  LICQ  condition  is  satisfied.  Let 
X*  be  the  Lagrange  multiplier  vector  for  which  the  KKT  conditions  ( 12.34)  are  satisfied.  Then 

wTV2x£(x*,X*)w  >  0,  for  all  weC(x*,X*).  (12.57) 

PROOF.  Since  x*  is  a  local  solution,  all  feasible  sequences  {z^}  approaching  x*  must  have 
f(Zk)  >  fix*)  for  all  k  sufficiently  large.  Our  approach  in  this  proof  is  to  construct  a 
feasible  sequence  whose  limiting  direction  is  w  and  show  that  the  property  f{zk)  >  fix*) 
implies  that  ( 12.57)  holds. 

Since  w  e  C(x* ,  X*)  C  T(x*),  we  can  use  the  technique  in  the  proof  of  Lemma  12.2 
to  choose  a  sequence  {?*}  of  positive  scalars  and  to  construct  a  feasible  sequence  {z*} 
approaching  x*  such  that 

lim  — -  =  w ,  (12.58) 

k-t-oo  t/c 

which  we  can  write  also  as  (12.58)  that 

Zk  ~  x*  =  tkW  +  o(tk).  (12.59) 

Because  of  the  construction  technique  for  {z^.},  we  have  from  formula  (12.42)  that 

dizk)  —  tk'Vciix*)7 w,  for  all e  yl(x*)  (12.60) 

From  (12.33),  (12.60),  and  (12.54),  we  have 

Cizk,X*)  =  f(zk)-  Y,  x*c^k) 

ie£UX 

=  fiZk)~tk  Y  tf^ciix*)TW 
ieA(x*) 

=  fiZk),  (12.61) 

On  the  other  hand,  we  can  perform  a  Taylor  series  expansion  to  obtain  an  estimate  of 
Cizk,  X*)  nearx*.  By  using  Taylor’s  theorem  expression  (2.6)  and  continuity  of  the  Hessians 
V2/  and  V2c,-,  i  e  £  U  T,  we  obtain 


Cizk,  X*)  =  C(x*,  X*)  +  (zk  -  x*)TVxCix*,  X*) 

+  \izk  -  x*)TV2xxC(x*,  X*)(zk  -  x*)  +  o(\\zk  ~  x*||2). 


(12.62) 
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By  the  complementarity  conditions  (12.34e),  we  have  C(x*,  A*)  —  fix*).  From  (12.34a), 
the  second  term  on  the  right-hand  side  is  zero.  Hence,  using  (12.59),  we  can  rewrite  ( 12.62) 
as 


C(Zk,  n  =  fix*)  +  \t2wTV2xxC{x*,  X*)  +  o{t2).  (12.63) 

By  substituting  into  (12.63),  we  obtain 

fizk)  =  fix*)  +  \tlwTV2xxC{x* ,  X*)w  +  o(tk).  (12.64) 

lfwTVxxC(x*,X*)w  <  0,  then  (12.64)  wouldimply  that  fizk)  <  fix*)  for  all  k  sufficiently 
large,  contradicting  the  fact  that  x*  is  a  local  solution.  Hence,  the  condition  (12.57)  must 
hold,  as  claimed.  □ 

Sufficient  conditions  are  conditions  on  /  andc,-,  i  e  8  UZ,  that  ensure  that  a  *  is  a  local 
solution  of  the  problem  (12.1).  (They  take  the  opposite  tack  to  necessary  conditions,  which 
assume  that  a*  is  a  local  solution  and  deduce  properties  of  /  and  c,- ,  for  the  active  indices 
i.)  The  second-order  sufficient  condition  stated  in  the  next  theorem  looks  very  much  like 
the  necessary  condition  just  discussed,  but  it  differs  in  that  the  constraint  qualification  is 
not  required,  and  the  inequality  in  ( 12.57)  is  replaced  by  a  strict  inequality. 

Theorem  12.6  (Second-Order  Sufficient  Conditions). 

Suppose  that  for  some  feasible  point  a*  e  R"  there  is  a  Lagrange  multiplier  vector  X* 
such  that  the  KKT  conditions  ( 12.34)  are  satisfied.  Suppose  also  that 

wTV2x£(x*,  X*)w  >  0,  for  all  w  e  C(a*,  A*),  w  /  0.  (12.65) 

Then  a*  is  a  strict  local  solution  for  (12.1 ). 

Proof.  First,  note  that  the  set  C  =  [d  e  C(x* ,  X*)  |  ||d||  =  1}  is  a  compact  subset  of 
C(x*,  A*),  so  by  (12.65),  the  minimizer  of  dTV2x£(x* ,  X*)d  over  this  set  is  a  strictly  positive 
number,  say  a.  Since  C(a*,  A*)  is  a  cone,  we  have  that  (w/||ur||)  e  C  if  and  only  if  w  e 
C(x*,  A*),  w/0.  Therefore,  condition  (12.65)  by 

wtVxxjC.{x*  ,  X*)w  >  er||w||2,  for  all  w  e  C(x*,  X*),  (12.66) 

for  a  >0  defined  as  above.  (Note  that  this  inequality  holds  trivially  for  w  —  0.) 

We  prove  the  result  by  showing  that  every  feasible  sequence  {z^}  approaching  a*  has 
fizk)  >  fix*)  +  (ct/4)||z*  —  a*  ||2,  for  all  k  sufficiently  large.  Suppose  for  contradiction 
that  this  is  not  the  case,  and  that  there  is  a  sequence  {z*}  approaching  a*  with 


fizk)  <  fix*)  +  (cr/4)||zr-  —  a*||2,  for  all  k  sufficiently  large. 


(12.67) 
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By  taking  a  subsequence  if  necessary,  we  can  identify  a  limiting  direction  d  such  that 

lim  — - -  =  d.  (12.68) 

k^oo  || Zk  ~  X*|| 

We  have  from  Lemma  12.2(i)  and  Definition  12.3  that  d  e  !F{x*).  From  (12.33)  and  the 
facts  that  X*  >  0  and  a  ( Zk )  >  0  for  i  el  and  c,  (zk)  —  0  for  i  e  £,  we  have  that 

C(zk,X*)  =  f(zk)~  Kci(zk)  <  f(zk),  (12.69) 

ieA(x*) 

while  the  Taylor  series  approximation  (12.63)  from  the  proof  of  Theorem  12.5  continues  to 
hold. 

If  d  were  not  in  C{x*,  X *),  we  could  identify  some  index  j  e  A(x*)  fl  T  such  that  the 
strict  positivity  condition 


X*S/Cj{x*)Td  >  0 

is  satisfied,  while  for  the  remaining  indices  i  e  A(x*),  we  have 


(12.70) 


X*Vcj{x*)Td  >  0. 


From  Taylor’s  theorem  and  (12.68),  we  have  for  this  particular  value  of  j  that 

X*cj(zk)  -  X*c j(x*)  +  X*Vcj(x*)T(zk  ~  x*)  +  o(||zt  -  Jt*||) 

=  ||Zt  -  x*\\X*Vcj(x*)Td  +  o(\\zk  -  x*||). 

Hence,  from  ( 12.69),  we  have  that 

CUkA*)  =  f(zk)-  W**) 

ieA(x*) 

<  fizk)  -  tfCj(Zk) 

<  fizk)-  \\zk-x*\\X*Vcj{x*)Td  +  o{\\zk-x*\\).  (12.71) 

From  the  Taylor  series  estimate  (12.63),  we  have  meanwhile  that 

C(zkA*)  =  f(x*)  +  0(\\zk-x*\\2). 


and  by  combining  with  (12.71),  we  obtain 

f(zk)  >  fix*)  +  II  Zk  -  x*\\X*Vcj(x*)Td  +  o(\\zk  -  JC*||). 
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Because  of  (12.70),  this  inequality  is  incompatible  with  (12.67).  We  conclude  that  d  e 
C{x *,  A*),  and  hence  dr  V2r£(x*,  X*)d  >  a . 

By  combining  the  Taylor  series  estimate  (12.63)  with  (12.69)  and  using  (12.68),  we 
obtain 


f(zk)  >  f(x*)  +  \izk  -  x*)TV2xxC{x*,  X*)(zk  -  X*)  +  oi\\zk  -  x*||2) 

=  fix*)  +  \dTV2xxL{x *,  X*)d\\zk  -  x*||2  +  oi\\zk  -  **||2) 

>  fix*)  +  (<r/2)||  Zk  -  x*\\2  +  oi\\zk  -  a'*||2). 

This  inequality  yields  the  contradiction  to  (12.67).  We  conclude  that  every  feasible  sequence 
{za  }  approaching  x*  must  satisfy  f(zk)  >  fix*)  +  (cr/4)  ||za  —  x*\\2,  for  all  k  sufficiently 
large,  so  x*  is  a  strict  local  solution.  □ 


□  Example  1 2.8  (Example  1 2.2,  One  More  Time) 

We  now  return  to  Example  12.2  to  check  the  second-order  conditions  for  problem 
(12.18).  In  this  problem  we  have  fix)  —  x\  +  x2,  cfx)  —  2  —  x\  —  x\,  £  =  0,  andZ  =  { 1}  - 
The  Lagrangian  is 


C{x ,  A)  =  {x\  +  xf)  —  Ai(2  —  x\  —  x2), 


and  it  is  easy  to  showthat  the  KKT  conditions  (12.34)  are  satisfied  by  x*  =  (—1,  — l)r,with 
A*  =  ^ .  The  Lagrangian  Hessian  at  this  point  is 


V2,£(x*,A*) 


r  2A* 

i 

o 

1 

1 

o 

o 

_ i 

2A* 

o 

ij 

This  matrix  is  positive  definite,  so  it  certainly  satisfies  the  conditions  of  Theorem  12.6.  We 
conclude  that  x*  =  (—1,  — l)r  is  a  strict  local  solution  for  (12.18).  (In  fact,  it  is  the  global 
solution  of  this  problem,  since,  as  we  note  later,  this  problem  is  a  convex  programming 
problem.) 


□  Example  1 2.9 


For  a  more  complex  example,  consider  the  problem 


min  —  0.1  (x!  —  4)2  +  x2 


s.t. 


X2  +  x2  -  1  >  0, 


(12.72) 
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in  which  we  seek  to  minimize  a  nonconvex  function  over  the  exterior  of  the  unit  circle. 
Obviously,  the  objective  function  is  not  bounded  below  on  the  feasible  region,  since  we  can 
take  the  feasible  sequence 


1 

0 

1 _ 

1 

0 

<N 

1 _ 

30 

1 

O 

1 _ 

1 

0 

_ 1 

’ 

1 

O 

’ 

1 

O 

_ 1 

’ 

1 

O 

and  note  that  f(x)  approaches  — oo  along  this  sequence.  Therefore,  no  global  solution 
exists,  but  it  may  still  be  possible  to  identify  a  strict  local  solution  on  the  boundary  of  the 
constraint.  We  search  for  such  a  solution  by  using  the  KKT  conditions  (12.34)  and  the 
second-order  conditions  of  Theorem  12.6. 

By  defining  the  Lagrangian  for  (12.72)  in  the  usual  way,  it  is  easy  to  verify  that 


V,X(x,A)  = 

V2xxC(x,k)  = 


— 0.2(xi  —  4)  —  2XiXi 
2X2  ~  2XiX2 
-0.2  -  2Xi  0 

0  2  —  2Xi 


(12.73a) 

(12.73b) 


The  point  x*  —  (l,0)r  satisfies  the  KKT  conditions  with  A.*  =  0.3  and  the  active  set 
>4(.r*)  =  { 1}.  To  check  that  the  second-order  sufficient  conditions  are  satisfied  at  this  point, 
we  note  that 


VdU*) 


2 

0 


so  that  the  set  C  defined  in  (12.53)  is  simply 

C(jc*,A*)  =  {(0,  w2)T  I  w2  e  R}. 

Now,  by  substituting  x*  and  X*  into  (12.73b),  we  have  for  any  w  e  C(x*,  X*)  with  w  ^  0 
that  u>2  ^  0  and  thus 


wTV2x£(x*,X*)w 


0 

T 

-0.4 

0 

w2 

0 

1.4 

\Aw\  >  0. 


Hence,  the  second-order  sufficient  conditions  are  satisfied,  and  we  conclude  from 
Theorem  12.6  that  (1,  0)r  is  a  strict  local  solution  for  (12.72). 
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SECOND-ORDER  CONDITIONS  AND  PROJECTED  HESSIANS 

The  second-order  conditions  are  sometimes  stated  in  a  form  that  is  slightly  weaker 
but  easier  to  verify  than  (12.57)  and  ( 12.65).  This  form  uses  a  two-sided  projection  of  the 
Lagrangian  Hessian  V^xC(x* ,  X*)  onto  subspaces  that  are  related  to  C{x*,  X*). 

The  simplest  case  is  obtained  when  the  multiplier  X *  that  satisfies  the  KKT  conditions 
(12.34)  is  unique  (as  happens,  for  example,  when  the  LICQ  condition  holds)  and  strict 
complementarity  holds.  In  this  case,  the  definition  (12.53)  ofC(x*,  X*)  reduces  to 

C(x\  X*)  =  Null  [Vci(jc*)r]I.6^(j[.)  =  Null  A(x*), 

where  A(x*)  is  defined  as  in  (12.37).  In  other  words,  C{x* ,  X*)  is  the  null  space  of  the  matrix 
whose  rows  are  the  active  constraint  gradients  at  x*.  As  in  (12.39),  we  can  define  the  matrix 
Z  with  full  column  rank  whose  columns  span  the  space  C(x*,  A*);that  is, 

C(x*,X*)  =  {Zu  |  u  e  R1'4^1}. 

Hence,  the  condition  (12.57)  in  Theorem  12.5  can  be  restated  as 
uTZTV2xxC(x*,X*)Zu>  0  for  all  u. 


or,  more  succinctly, 


ZT  V4t£(x*,  X*)Z  is  positive  semidefinite. 


Similarly,  the  condition  (12.65)  in  Theorem  12.6  can  be  restated  as 
ZTVxxC(x* ,  A*)Z  is  positive  definite. 


As  we  show  next,  Z  can  be  computed  numerically,  so  that  the  positive  (semi) definiteness 
conditions  can  actually  be  checked  by  forming  these  matrices  and  finding  their  eigenvalues. 

One  way  to  compute  the  matrix  Z  is  to  apply  a  QR  factorization  to  the  matrix  of 
active  constraint  gradients  whose  null  space  we  seek.  In  the  simplest  case  above  (in  which 
the  multiplier  X*  is  unique  and  strictly  complementary  holds),  we  define  A{x*)  as  in  (12.37) 
and  write  the  QR  factorization  of  its  transpose  as 


A{x*)T  =  Q 


R 

0 


[  Gi  Qi  ] 


QiR, 


(12.74) 


where  R  is  a  square  upper  triangular  matrix  and  Q  is  n  x  n  orthogonal.  If  R  is  nonsingular, 
we  can  set  Z  =  Qi.  If  R  is  singular  (indicating  that  the  active  constraint  gradients  are 
linearly  dependent),  a  slight  enhancement  of  this  procedure  that  makes  use  of  column 
pivoting  during  the  QR  procedure  can  be  used  to  identify  Z. 


338  Chapter  12.  Theory  of  Constrained  Optimization 

1 2.6  OTHER  CONSTRAINT  QUALIFICATIONS 

We  now  reconsider  constraint  qualifications,  the  conditions  discussed  in  Sections  12.2  and 
12.4  that  ensure  that  the  linearized  approximation  to  the  feasible  set  £2  captures  the  essential 
shape  of  £2  in  a  neighborhood  ofx*. 

One  situation  in  which  the  linearized  feasible  direction  set  lF(x*)  is  obviously  an 
adequate  representation  of  the  actual  feasible  set  occurs  when  all  the  active  constraints  are 
already  linear;  that  is, 

Ci(x)  —  aj x  +  b, ,  (12.75) 

for  some  a,  e  R"  and  /?,  e  R.  It  is  not  difficult  to  prove  a  version  of  Lemma  12.2  for  this 
situation. 

Lemma  12.7. 

Suppose  that  at  some  x*  e  Q,,  all  active  constraints  Cj(-),i  e  A{x*),  are  linear  functions. 
ThenTlx*)  =  Ta(x*). 

PROOF.  We  have  from  Lemma  12.2  (i)  that  Tq{x*)  C  T{x*).  To  prove  that  T{x*)  C  T^(x*), 
we  choose  an  arbitrary  w  e  lF(x*)  and  show  that  w  e  Tq(x*).  By  Definition  12.3  and  the 
form  ( 12.75)  of  the  constraints,  we  have 

aj d  =  0,  for  all  ie£, 
aj  cl  >  0,  for  all  i  e  A(x)  fl  X 

First,  note  that  there  is  a  positive  scalar  t  such  that  the  inactive  constraint  remain 
inactive  at  x*  +  tw,  for  all  1  e  [0,  t ],  that  is, 

Ci(x*  +  tw)  >  0,  for  all  i  e  X\^4(.r*)  and  all  t  e  [0,  t\. 

Now  define  the  sequence  Zk  by 

Zk  —  x*  +  ( t/k)w ,  k  =  1,2, _ 

Since  a[vj  >  0  for  all  /  elfl  A{x*),  we  have 

Ci(zk )  =  Ci(zk)  —  Ci(x*)  —  aj  ( Zk  —  x*)  —  -aj  w  >  0,  for  all  i  elfl  .4(x*), 

k 

so  that  Zk  is  feasible  with  respect  to  the  active  inequality  constraints  c,-,  i  e  XC\A(x*).  By  the 
choice  of  t ,  we  find  that  Zk  is  also  feasible  with  respect  to  the  inactive  inequality  constraints 
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i  e  T\A(x*),  and  it  is  easy  to  show  that  c;  (zk)  —  0  for  the  equality  constraints  i  e  £.  Hence, 
Zk  is  feasible  for  each  k  —  1,  2,  ....  In  addition,  we  have  that 

Zk  ~  x*  _  (; t/k)w  _ 

(t/k)  ( t/k ) 

so  that  indeed  w  is  the  limiting  direction  of  {zk}-  Hence,  w  e  Tq(x*),  and  the  proof  is 
complete.  □ 

We  conclude  from  this  result  that  the  condition  that  all  active  constraints  be  linear  is 
another  possible  constraint  qualification.  It  is  neither  weaker  nor  stronger  than  the  LICQ 
condition,  that  is,  there  are  situations  in  which  one  condition  is  satisfied  but  not  the  other 
(see  Exercise  12.12). 

Another  useful  generalization  of  the  LICQ  is  the  Mangasarian-Fromovitz  constraint 
qualification  (MFCQ). 

Definition  12.6  (MFCQ). 

We  say  that  the  Mangasarian-Fromovitz  constraint  qualification  (MFCQ)  holds  if  there 
exists  a  vector  w  e  R"  such  that 

Vci{x*)Tw  >  0,  for  all  /  e  A(x*)C\X, 

Vcj{x*)T w  —  0,  for  all  /  e  £, 

and  the  set  of  equality  constraint  gradients  {Vc,(x*),  /  e  £}  is  linearly  independent. 

Note  the  strict  inequality  involving  the  active  inequality  constraints. 

The  MFCQ  is  a  weaker  condition  than  LICQ.  If  LICQ  is  satisfied,  then  the  system  of 
equalities  defined  by 


Vc,(x*)ru>  —  1,  for  all  /  e  A(x*)  CM, 

Vd(x*)Tw  =  0,  for  all  /  e  £, 

has  a  solution  w,  by  full  rank  of  the  active  constraint  gradients.  Hence,  we  can  choose  the 
w  of  Definition  12.6  to  be  precisely  this  vector.  On  the  other  hand,  it  is  easy  to  construct 
examples  in  which  the  MFCQ  is  satisfied  but  the  LICQ  is  not;  see  Exercise  12.13. 

It  is  possible  to  prove  a  version  of  the  first-order  necessary  condition  result  (Theo¬ 
rem  12.1)  in  which  MFCQ  replaces  LICQ  in  the  assumptions.  MFCQ  gives  rise  to  the  nice 
property  that  it  is  equivalent  to  boundedness  of  the  set  of  Lagrange  multiplier  vectors  X*  for 
which  the  KKT  conditions  (12.34)  are  satisfied.  (In  the  case  of  LICQ,  this  set  consists  of  a 
unique  vector  X*,  and  so  is  trivially  bounded.) 

Note  that  constraint  qualifications  are  sufficient  conditions  for  the  linear  approxima¬ 
tion  to  be  adequate,  not  necessary  conditions.  For  instance,  consider  the  set  defined  by  X2  > 
— x\  and  X2  <  Xj  and  the  feasible  point  x*  =  (0,  0)T .  None  of  the  constraint  qualifications 
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we  have  discussed  are  satisfied,  but  the  linear  approximation  T{x*)  —  {(u>i,  0)r  |  W\  e  R} 
accurately  reflects  the  geometry  of  the  feasible  set  near  x*. 


1 2.7  A  GEOMETRIC  VIEWPOINT 


Finally,  we  mention  an  alternative  first-order  optimality  condition  that  depends  only  on  the 
geometry  of  the  feasible  set  £2  and  not  on  its  particular  algebraic  description  in  terms  of  the 
constraint  functions  c, ,  i  e  £  U  X.  In  geometric  terms,  our  problem  (12.1)  can  be  stated  as 

min  f{x)  subject  to  x  e  £2,  (12.76) 


where  £2  is  the  feasible  set. 

To  prove  a  “geometric”  first-order  condition,  we  need  to  define  the  normal  cone  to 
the  set  £2  at  a  feasible  point  x. 

Definition  12.7. 

The  normal  cone  to  the  set  £2  at  the  point  x  e  £2  is  defined  as 

Nn{x)  =  {i>  |  vTw  <  0  for  all  w  e  7h(x)},  (12.77) 

where  Tq{x)  is  the  tangent  cone  of  Definition  12.2.  Each  vector  v  e  Nq(x)  is  said  to  be  a 
normal  vector. 

Geometrically,  each  normal  vector  v  makes  an  angle  of  at  least  7r/2  with  every  tangent 
vector. 

The  first-order  necessary  condition  for  (12.76)  is  delightfully  simple. 

Theorem  12.8. 

Suppose  thatx*  is  a  local  minimizer  of  f  in  £2.  Then 

-  V fix*)  e  Nq{x*).  (12.78) 

PROOF.  Given  any  d  e  7h(x*),  we  have  for  the  sequences  {4}  and  {zk}  in  Definition  12.2 
that 


Zk  o  £2 ,  Zk  —  x*  +  tkd  +  0(4),  for  all  k.  (12.79) 

Since  x*  is  a  local  solution,  we  must  have 


fizf)  >  fix*) 
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for  all  k  sufficiently  large.  Hence,  since  /  is  continuously  differentiable,  we  have  from  Taylor’s 
theorem  (2.4)  that 


f(zk)  ~  fix*)  =  tkV  f(x*)Td  +  o(tk)  >  0. 

By  dividing  by  tk  and  taking  limits  as  k  ->  oo,  we  have 

Vf(x*)Td  >  0. 

Recall  that  d  was  an  arbitrary  member  of  Tq(x*),  so  we  have  —Vf(x*)Td  <  0  for  all 
d  e  Tn(x*).  We  conclude  from  Definition  12.7  that  — V/(x*)  e  1Vq(x*).  □ 

This  result  suggests  a  close  relationship  between  A^(x*)  and  the  conic  combination 
of  active  constraint  gradients  given  by  (12.50).  When  the  linear  independence  constraint 
qualification  holds,  identical  (to  within  a  change  of  sign). 

Lemma  12.9. 

Suppose  that  the  LICQ  assumption  (Definition  12.4)  holds  atx* .  Then  t  the  normal  cone 
Nq(x*)  is  simply  —N,  where  N  is  the  set  defined  in  (12.50). 

Proof.  The  proof  follows  from  Farkas’  Lemma  (Lemma  12.4)  and  Definition  12.7  of 
Nn(x*).  From  Lemma  12.4,  we  have  that 

g  e  N  =Y  gTd  >  0  for  all  cf  e  LF(x*). 

Since  we  have  LF(x*)  =  Tb(x*)  from  Lemma  12.2,  it  follows  by  switching  the  sign  of  this 
expression  that 


g  e  —N  =Y  gTd  <  0  for  all  d  e  7q(x*). 

We  conclude  from  Definition  12.7  that  Nq(x*)  =  —N,  as  claimed.  □ 


1 2.8  LAGRANGE  MULTIPLIERS  AND  SENSITIVITY 


The  importance  of  Lagrange  multipliers  in  optimality  theory  should  be  clear,  but  what  of 
their  intuitive  significance?  We  show  in  this  section  that  each  Lagrange  multiplier  X*  tells  us 
something  about  the  sensitivity  of  the  optimal  objective  value  f(x*)  to  the  presence  of  the 
constraint  c,- .  To  put  it  another  way,  X*  indicates  how  hard  /  is  “pushing”  or  “pulling”  the 
solution  x*  against  the  particular  constraint  c,  . 

We  illustrate  this  point  with  some  informal  analysis.  When  we  choose  an  inactive 
constraint  i  <£  .4(x*)  such  that  c,(x*)  >  0,  the  solution  x*  and  function  value  fix*)  are 
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indifferent  to  whether  this  constraint  is  present  or  not.  If  we  perturb  c,-  by  a  tiny  amount,  it 
will  still  be  inactive  and  x*  will  still  be  a  local  solution  of  the  optimization  problem.  Since 
X*  —  0  from  ( 12.34e),  the  Lagrange  multiplier  indicates  accurately  that  constraint  i  is  not 
significant. 

Suppose  instead  that  constraint  i  is  active,  and  let  us  perturb  the  right-hand-side  of  this 
constraint  a  little,  requiring,  say,  that  c,(x)  >  —  e||  Vc,(x*)||  instead  of  c,(x)  >  0.  Suppose 
that  c  is  sufficiently  small  that  the  perturbed  solution  x*(e)  still  has  the  same  set  of  active 
constraints,  and  that  the  Lagrange  multipliers  are  not  much  affected  by  the  perturbation. 
(These  conditions  can  be  made  more  rigorous  with  the  help  of  strict  complementarity  and 
second-order  conditions.)  We  then  find  that 

-e||Vc,(x*)||  =  q(x*(€))  -  q(x*)  «  (x*(e)  -  x*)rVa(x*), 

0  =  cj(x*(e))  —  Cj(x*)  (x*(e)  —  x*)rVc7(x*), 

for  all  j  e  ^4(x*)  with  j  ^  i. 


The  value  of  /(x*(e)),  meanwhile,  can  be  estimated  with  the  help  of  (12.34a).  We  have 

/(x*(0)  -  fix*)  «  (x*(e)  -  x*)TV  f(x*) 

=  J2  *■*(**(£)- x*)TVcj(x*) 
jeA(x*) 

« -€||Vci(x*)||A.?. 


By  taking  limits,  we  see  that  the  family  of  solutions  x*(e)  satisfies 


df{x*(e)) 

de 


— A.;i|Vci(x*)||. 


(12.80) 


A  sensitivity  analysis  of  this  problem  would  conclude  that  if  A*||  Vc,  (x*)||  is  large,  then  the 
optimal  value  is  sensitive  to  the  placement  of  the  i  th  constraint,  while  if  this  quantity  is 
small,  the  dependence  is  not  too  strong.  If  X*  is  exactly  zero  for  some  active  constraint,  small 
perturbations  to  c,-  in  some  directions  will  hardly  affect  the  optimal  objective  value  at  all; 
the  change  is  zero,  to  first  order. 

This  discussion  motivates  the  definition  below,  which  classifies  constraints  according 
to  whether  or  not  their  corresponding  Lagrange  multiplier  is  zero. 

Definition  12.8. 

Letx*  be  a  solution  of  the  problem  (12.1 ),  and  suppose  that  the  KKT  conditions  (12.34) 
are  satisfied.  Wesaythatan  inequality  constraint  ci  is  strongly  active  or  binding  ifi  e  _4(x*) 
andX*  >  0  for  some  Lagrange  multiplier  X*  satisfying  (12.34).  We  say  that  c;  is  weakly  active 
ifi  €  -4(x*)  andX*  =  0  forallX*  satisfying  (12.34). 

Note  that  the  analysis  above  is  independent  of  scaling  of  the  individual  constraints. 
For  instance,  we  might  change  the  formulation  of  the  problem  by  replacing  some  active 
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constraint  c;  by  10c,-.  The  new  problem  will  actually  be  equivalent  (that  is,  it  has  the  same 
feasible  set  and  same  solution),  but  the  optimal  multiplier  X*  corresponding  to  c,-  will 
be  replaced  by  A*/10.  However,  since  ||  Vc,-(x*)||  is  replaced  by  10||  Vc,(x*)||,  the  product 
A*  ||  Vcj  (x*)||  does  not  change.  If,  on  the  other  hand,  we  replace  the  objective  function  /  by 
10/,  the  multipliers  X *  in  (12.34)  all  will  need  to  be  replaced  by  10A.*.  Hence  in  (12.80)  we 
see  that  the  sensitivity  of  /  to  perturbations  has  increased  by  a  factor  of  10,  which  is  exactly 
what  we  would  expect. 


12.9  DUALITY 


In  this  section  we  present  some  elements  of  the  duality  theory  for  nonlinear  program¬ 
ming.  This  theory  is  used  to  motivate  and  develop  some  important  algorithms,  including 
the  augmented  Lagrangian  algorithms  of  Chapter  17.  In  its  full  generality,  duality  theory 
ranges  beyond  nonlinear  programming  to  provide  important  insight  into  the  fields  of  con¬ 
vex  nonsmooth  optimization  and  even  discrete  optimization.  Its  specialization  to  linear 
programming  proved  central  to  the  development  of  that  area;  see  Chapter  13.  (We  note  that 
the  discussion  of  linear  programming  duality  in  Section  13.1  can  be  read  without  consulting 
this  section  first.) 

Duality  theory  shows  how  we  can  construct  an  alternative  problem  from  the  functions 
and  data  that  define  the  original  optimization  problem.  This  alternative  “dual”  problem  is 
related  to  the  original  problem  (which  is  sometimes  referred  to  in  this  context  as  the  “primal” 
for  purposes  of  contrast)  in  fascinating  ways.  In  some  cases,  the  dual  problem  is  easier  to 
solve  computationally  than  the  original  problem.  In  other  cases,  the  dual  can  be  used  to 
obtain  easily  a  lower  bound  on  the  optimal  value  of  the  objective  for  the  primal  problem. 
As  remarked  above,  the  dual  has  also  been  used  to  design  algorithms  for  solving  the  primal 
problem. 

Our  results  in  this  section  are  mostly  restricted  to  the  special  case  of  (12.1)  in  which 
there  are  no  equality  constraints  and  the  objective  /  and  the  negatives  of  the  inequality 
constraints  —  c,-  are  all  convex  functions.  For  simplicity  we  assume  that  there  are  m  inequality 
constraints  labelled  1,2, ...  ,m  and  rewrite  (12.1)  as  follows: 

min  f(x)  subject  to  c,(x)  >0,  i  —  1,2, ... ,  m. 

xeR" 

If  we  assemble  the  constraints  into  a  vector  function 

c(x )  =  ( c i (x ) ,  c2(x) - -  cm(x))T, 

we  can  write  the  problem  as 


min/(x)  subject  to  c(x)  >  0, 

xeRn 


(12.81) 
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for  which  the  Lagrangian  function  (12.16)  with  Lagrange  multiplier  vector  X  e  R'"  is 

C{x,  X)  =  f(x)  —  XT  c(x). 

We  define  the  dual  objective  function  q  :  R"  ->  R  as  follows: 

q(X)  inf  C(x,  X).  (12.82) 

In  many  problems,  this  infimum  is  — oo  for  some  values  of  X.  We  define  the  domain  of  q  as 
the  set  of  X  values  for  which  q  is  finite,  that  is, 

V  =  {A.  |  c?(X)  >  -oo}.  (12.83) 

Note  that  calculation  of  the  infimum  in  (12.82)  requires  finding  the  global  minimizer  of  the 
function  £(•,  X)  for  the  given  X  which,  as  we  have  noted  in  Chapter  2,  may  be  extremely 
difficult  in  practice.  However,  when  /  and  —  c;  are  convex  functions  and  X  >  0  (the  case  in 
which  we  are  most  interested),  the  function  £(■,  X)  is  also  convex.  In  this  situation,  all  local 
minimizers  are  global  minimizers  (as  we  verify  in  Exercise  12.4),  so  computation  of  q(X) 
becomes  a  more  practical  proposition. 

The  dual  problem  to  ( 12.81)  is  defined  as  follows: 

maxfl(A)  subjectto  X  >  0.  (12.84) 

xeR" 


□  Example  1 2.1 0 

Consider  the  problem 

min  0.5(Xi  +  x%)  subject  to  Xj  —  1  >  0.  (12.85) 

(*1,*2) 

The  Lagrangian  is 

£{x i,  X2,  Ai)  =  0.5(xi  +  x\)  —  Ai(xi  —  1). 

If  we  hold  Ai  fixed,  this  is  a  convex  function  of  (xi,  X2 )T ■  Therefore,  the  infimum  with 
respect  to  (xi,  X2)r  is  achieved  when  the  partial  derivatives  with  respect  to  X\  and  X2  are 
zero,  that  is, 


X\  —  Xi  —  0,  X2  =  0. 
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By  substituting  these  infimal  values  into  C{x i,  X2,  Ai)  we  obtain  the  dual  objective  (12.82): 


q(X\)  —  0.5(Aj  +  0)  —  Aj(Ai  —  1)  —  — 0.52.^  A^. 


Hence,  the  dual  problem  (12.84)  is 


max  — 0.5A?  +  A],  (12.86) 

li>0 


which  clearly  has  the  solution  Ai  =  1. 


□ 


In  the  remainder  of  this  section,  we  show  how  the  dual  problem  is  related  to  (12.81). 
Our  first  result  concerns  concavity  of  q. 

Theorem  12.10. 

The  function  q  defined  by  (12.82)  is  concave  and  its  domain  V  is  convex. 

Proof.  For  any  A0  and  A1  in  R"',  any  i  e  E”,  and  any  a  e  [0,  1],  we  have 

C(x,  (1  —  cn)A0  +  crA1)  =  (1  —  a)C(x,  A0)  +  a£{x,  A1). 

By  taking  the  infimum  of  both  sides  in  this  expression,  using  the  definition  (12.82),  and 
using  the  results  that  the  infimum  of  a  sum  is  greater  than  or  equal  to  the  sum  of  infimums, 
we  obtain 


i?((l  —  a)A°  +  aA1)  >  (1  —  a)q{ A0)  +  aq( A1), 

confirming  concavity  of  q.  If  both  A0  and  A1  belong  to  V,  this  inequality  implies  that 
q((l  —  a)X°  +  a  A1)  >  — oo  also,  and  therefore  (1  —  a)A°  +  aA1  e  V,  verifying  convexity 
of  V.  □ 

The  optimal  value  of  the  dual  problem  ( 12.84)  gives  a  lower  bound  on  the  optimal 
objective  value  for  the  primal  problem  (12.81).  This  observation  is  a  consequence  of  the 
following  weak  duality  result. 

Theorem  12.11  (Weak  Duality). 

For  anyx  feasible  for  (12.81 )  and  anyX  >  0,  we  haveq(X )  <  f(x). 


Proof. 

q( A)  =  inf  f(x)  —  A Tc(x)  <  f{x)  —  X Tc(x)  <  f(x), 

X 


where  the  final  inequality  follows  from  A  >  0  and  c(x)  >  0. 


□ 
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For  the  remaining  results,  we  note  that  the  KKT  conditions  (12.34)  specialized  to 
( 12.81)  are  as  follows: 


V/(x)  —  Vc(x)A  =  0,  (12.87a) 

c(x)  >  0,  (12.87b) 

A  >  0,  (12.87c) 

Xid(x)  =  0,  i  =  1,2,  ...,m,  (12.87d) 

where  Vc(x)  is  the  n  x  m  matrix  defined  by  Vc(x)  =  [Vci(x),  Vc2(x),  . . . ,  Vc,„(x)]. 


The  next  result  shows  that  optimal  Lagrange  multipliers  for  (12.81)  are  solutions  of 
the  dual  problem  (12.84)  under  certain  conditions.  It  is  essentially  due  to  Wolfe  [309], 

Theorem  12.12. 

Suppose  thatx  is  a  solution  of  (12.81 )  and  that  f  and  —a,  i  —  1,2, ...  ,m  are  convex 
functions  on  R"  that  are  differentiable  at  x.  Then  any  X  for  which  (x,  X)  satisfies  the  KKT 
conditions  (12.87)  is  a  solution  of  (12.84). 

PROOF.  Suppose  that  (x,  X)  satisfies  (12.87).  We  have  from  X  >  0  that  £(■,  X)  is  a  convex 
and  differentiable  function.  Hence,  for  any  x,  we  have 

£(x,  X)  >  £(x,  X)  +  Vr£(x,  X)T (x  —  x)  —  £(x,  X), 

where  the  last  equality  follows  from  (12.87a).  Therefore,  we  have 


q(X)  =  inf  £(x,  X)  —  £(x,  X)  —  f(x)  —  XT c(x)  —  f(x), 

X 


where  the  last  equality  follows  from  (12.87d).  Since  from  Theorem  12.11,  we  have  q(X )  < 
f(x)  for  all  X  >  0  it  follows  immediately  from  q(X)  =  f(x)  that  A  is  a  solution  of 
(12.84).  □ 

Note  that  if  the  functions  are  continuously  differentiable  and  a  constraint  qualification 
such  as  LICQ  holds  at  x,  then  an  optimal  Lagrange  multiplier  is  guaranteed  to  exist,  by 
Theorem  12.1. 

In  Example  12.10,  we  see  that  Ai  =  1  is  both  an  optimal  Lagrange  multiplier  for  the 
problem  (12.85)  and  a  solution  of  (12.86).  Note  too  that  the  optimal  objective  for  both 
problems  is  0.5. 

We  prove  a  partial  converse  ofTheorem  12.12,  which  shows  that  solutions  to  the  dual 
problem  (12.84)  can  sometimes  be  used  to  derive  solutions  to  the  original  problem  (12.81). 
The  essential  condition  is  strict  convexity  of  the  function  £(-,X)  for  a  certain  value  X.  We 
note  that  this  condition  holds  if  either  /  is  strictly  convex  (as  is  the  case  in  Example  12.10) 
or  if  c,  is  strictly  convex  for  some  i  =  1,2 , ...  ,m  with  A,-  >  0. 
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Theorem  12.13. 

Suppose  that  f  and  — Cj,  i  —  1,2 , ,m  are  convex  and  continuously  differentiable  on 
R" .  Suppose  thatx  is  a  solution  of  (12.81 )  at  which  LICQ  holds.  Suppose  thatX  solves  ( 12.84) 
and  that  the  infimum  in  infv  £(jc,  X)  is  attained  atx.  Assume  further  than  £(•,  X)  is  a  strictly 
convexfunction.  Thenx  —  x  (that  is,  x  is  the  unique  solution  of  (12.81)),  andf(x)  =  £(x,  X). 

Proof.  Assume  for  contradiction  that  x  f-  x.  From  Theorem  12.1,  because  of  the  LICQ 
assumption,  there  exists  X  satisfying  (12.87).  Hence,  from  Theorem  12.12,  we  have  that  X 
also  solves  (12.84),  so  that 


£(x,  X)  —  q(X)  =  q(X)  =  £(x,  X). 

Because  x  —  argmin,  £(x,  X),  we  have  from  Theorem  2.2  that  Vx£(x,  X)  —  0.  Moreover, 
by  strict  convexity  of  £(■,  X),  it  follows  that 

£(x,  X)  —  £(x,  X)  >  Vx£(x,  X)T (x  —  x)  =  0. 


Hence,  we  have 


£(x,  X)  >  £(x,  X)  —  £(x,  X), 


so  in  particular  we  have 


—XTc(x)  >  —XTc(x)  —  0, 

where  the  final  equality  follows  from  (12.87d).  Since  X  >  0  and  c(x )  >  0,  this  yields  the 
contradiction,  and  we  conclude  that  x  =  x,  as  claimed.  □ 

In  Example  12.10,atthedualsolutionLi  =  1,  the  infimum  of  £  (ay,  xi,  A.Q  is  achieved 
at  (jti,  X2)  —  (1,  0)T,  which  is  the  solution  of  the  original  problem  (12.85). 

An  slightly  different  form  of  duality  that  is  convenient  for  computations,  known  as 
the  Wolfe  dual  [309],  can  be  stated  as  follows: 

max£(.r,k)  (12.88a) 

x,X. 

subject  to  Vv£(.r,  A.)  =  0,  X  >  0.  (12.88b) 

The  following  results  explains  the  relationship  of  the  Wolfe  dual  to  (12.81). 

Theorem  12.14. 

Suppose  that  f  and  —Cj,i  —  1,2, ...  ,m  are  convex  and  continuously  differentiable  on 
R”.  Suppose  that  (x,  X)  is  a  solution  pair  of  (12.81 )  at  which  LICQ  holds.  Then  (x,  X)  solves 
the  problem  (12.88). 
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Proof.  From  the  KKT  conditions  (12.87)  we  have  that  ( x ,  X)  satisfies  (12.88b),  and  that 
C(x,  X)  =  fix).  Therefore  for  any  pair  (x,  X)  that  satisfies  (12.88b)  we  have  that 

C(x,  X)  =  f{x) 

>  f(x)  -  XTc(x) 

—  C(x,  X) 

>  C(x,  X)  +  Vt£(x,  X)t  (x  —  x) 

—  C(x,  X), 

where  the  second  inequality  follows  from  the  convexity  of  £(•,  A).  We  have  therefore  shown 
that  (x,  X)  maximizes  C  over  the  constraints  (12.88b),  and  hence  solves  (12.88).  □ 


□  Example  12.11  (Linear  Programming) 

An  important  special  case  of  (12.81)  is  the  linear  programming  problem 

min  cTx  subject  to  Ax  —  b  >  0,  (12.89) 

for  which  the  dual  objective  is 

q(X)  —  inf  [cT x  —  XT (Ax  —  /?)]  =  inf  [(c  —  AT X)T x  +  bT A.]  . 

If  c  —  AT X  f-  0,  the  infimum  is  clearly  —  oo  (we  can  set  x  to  be  a  large  negative  multiple 
of  — (c  —  AT X)  to  make  q  arbitrarily  large  and  negative).  When  c  —  AT X  —  0,  on  the 
other  hand,  the  dual  objective  is  simply  bTX.  In  maximizing  q,  we  can  exclude  X  for  which 
c  —  ATX  f-  0  from  consideration  (the  maximum  obviously  cannot  be  attained  at  a  point  X 
for  which  q(X)  —  —  oo).  Hence,  we  can  write  the  dual  problem  ( 12.84)  as  follows: 

max  bT X  subject  to  AT X  —  c,  X  >  0.  (12.90) 

x 

The  Wolfe  dual  of  ( 12.89)  can  be  written  as 

max  cT x  —  XT  (Ax  —  b)  subject  to  AT X  —  c,  X  >  0, 

x 

and  by  substituting  the  constraint  AT X  —  c  =  0  into  the  objective  we  obtain  (12.90)  again. 

For  some  matrices  A,  the  dual  problem  ( 12.90)  maybe  computationally  easier  to  solve 
than  the  original  problem  (12.89).  We  discuss  the  possibilities  further  in  Chapter  13. 
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□  Example  1 2.1 2  (Convex  Quadratic  Programming) 

Consider 

min  -xtGx  +  cTx  subject  to  Ax  —  b  >  0,  (12.91) 

where  G  is  a  symmetric  positive  definite  matrix.  The  dual  objective  for  this  problem  is 

q(X)  —  inf  C(x,  X)  —  inf  -xT Gx  +  cT x  —  XT (Ax  —  b).  (12.92) 

X  X  2 

Since  G  is  positive  definite,  since  £(•,  X)  is  a  strictly  convex  quadratic  function,  the  infimum 
is  achieved  when  VxC(x,  X)  =  0,  that  is, 

Gx  +  c  —  AtX  —  0.  (12.93) 

Hence,  we  can  substitute  for  x  in  the  infimum  expression  and  write  the  dual  objective 
explicitly  as  follows: 

q(X)  =  ~^(ATX  -  c)tG~\AtX  -  c)T  +  bTX. 

Alternatively,  we  can  write  the  Wolfe  dual  form  ( 12.88)  by  retaining  x  as  a  variable  and 
including  the  constraint  (12.93)  explicitly  in  the  dual  problem,  to  obtain 

max  -xT  Gx  +  cT  x  —  XT  (Ax  —  b)  (12.94) 

(x.x)  2 

subject  to  Gx  +  c  —  AT X  =  0,  X  >  0. 

To  make  it  clearer  that  the  objective  is  concave,  we  can  use  the  constraint  to  substitute 
(c  —  ATX)Tx  —  —xtGx  in  the  objective,  and  rewrite  the  dual  formulation  as  follows: 

max — xTGx  +  XTb,  subject  to  Gx  +  c  —  AT X  —  0,  X  >  0.  (12.95) 

(A.,x)  2 

□ 


Note  that  the  Wolfe  dual  form  requires  only  positive  semidefiniteness  of  G. 

NOTES  AND  REFERENCES 

The  theory  of  constrained  optimization  is  discussed  in  many  books  on  numerical 
optimization.  The  discussion  in  Fletcher  [101,  Chapter  9]  is  similar  to  ours,  though  a  little 
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terser,  and  includes  additional  material  on  duality.  Bertsekas  [  19,  Chapter  3]  emphasizes  the 
role  of  duality  and  discusses  sensitivity  of  the  solution  with  respect  to  the  active  constraints 
in  some  detail.  The  classic  treatment  of  Mangasarian  [198]  is  particularly  notable  for  its 
thorough  description  of  constraint  qualifications.  It  also  has  an  extensive  discussion  of 
theorems  of  the  alternative  [198,  Chapter  2],  placing  Farkas’  Lemma  firmly  in  the  context 
of  other  related  results. 

The  KKT  conditions  were  described  in  a  1951  paper  of  Kuhn  and  Tucker  [185], 
though  they  were  derived  earlier  (and  independently)  in  an  unpublished  1939  master’s 
thesis  of  W.  Karush.  Lagrange  multipliers  and  optimality  conditions  for  general  problems 
(including  nonsmooth  problems)  are  described  in  the  deep  and  wide-ranging  article  of 
Rockafellar  [270]. 

Duality  theory  for  nonlinear  programming  is  described  in  the  books  of  Rockafel¬ 
lar  [198]  and  Bertsekas  [19];  the  latter  treatment  is  particularly  extensive  and  general.  The 
material  in  Section  12.9  is  adapted  from  these  sources. 

We  return  to  our  claim  that  the  set  N  defined  by 

N  —  {By  +  Ct  |  y  >  0}, 

(where  B  and  C  are  matrices  of  dimension  n  x  m  and  n  x  p,  respectively,  and  y  and  t  are 
vectors  of  appropriate  dimensions;  see  (12.45))  is  a  closed  set.  This  fact  is  needed  in  the 
proof  of  Lemma  12.4  to  ensure  that  the  solution  of  the  projection  subproblem  (12.47)  is 
well-defined.  The  following  technical  result  is  well  known;  the  proof  given  below  is  due  to 
R.  Byrd. 

Lemma  12.15. 

The  set  N  is  closed. 

PROOF.  By  splitting  t  into  positive  and  negative  parts,  it  is  easy  to  see  that 


y 

y 

1 

N  =  ■ 

[  B  C  -C  ] 

t+ 

t+ 

>  0 

r 

r 

J 

Hence,  we  can  assume  without  loss  of  generality  that  N  has  the  form 

N  —  {By  |  y  >  0}. 

Suppose  that  B  has  dimensions  n  x  m . 

First,  we  show  that  for  any  s  e  N,  we  can  write  s  —  Bjyi  with  yj  >  0,  where 
I  C  {1,2,...,  m},  Bi  is  the  column  submatrix  of  B  indexed  by  I  with  full  column  rank, 
and  I  has  minimum  cardinality.  To  prove  this  claim,  we  assume  for  contradiction  that 
K  C  {1,  2,  . . . ,  m]  is  an  index  set  with  minimal  cardinality  such  that  s  =  BKyK,yK  >  0,  yet 
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the  columns  of  Bk  are  linearly  dependent.  Since  K  is  minimal,  y k  has  no  zero  components. 
We  then  have  a  nonzero  vector  w  such  that  Bk  w  —  0.  Since  s  =  Bk  {yx  +  t  w )  for  any  r ,  we 
can  increase  or  decrease  r  from  0  until  one  or  more  components  of  vk  +  tw  become  zero, 
while  the  other  components  remain  positive.  We  define  K  by  removing  the  indices  from  K 
that  correspond  to  zero  components  of  yK  +  xw,  and  define  y^  to  be  the  vector  of  strictly 
positive  components  of  yK  +  tw.  We  then  have  that  s  —  B^Xk  and  y k  >0,  contradicting 
our  assumption  that  K  was  the  set  of  minimal  cardinality  with  this  property. 

Now  let  {s*}  be  a  sequence  with  sk  e  N  for  all  k  and  sk  —>  s.  We  prove  the  lemma 
by  showing  that  s  e  N.  By  the  claim  of  the  previous  paragraph,  for  all  k  we  can  write 
sk  —  BitykIk  with  yk  >  0,  4  is  minimal,  and  the  columns  of  Bjk  are  linearly  independent. 
Since  there  only  finitely  many  possible  choices  of  index  set  4,  at  least  one  index  set  occurs 
infinitely  often  in  the  sequence.  By  choosing  such  an  index  set  /,  we  can  take  a  subsequence 
if  necessary  and  assume  without  loss  of  generality  that  4  =  I  for  all  k.  We  then  have  that 
sk  =  Ajykj  with  yk  >  0  and  Aj  has  full  column  rank.  Because  of  the  latter  property,  we 
have  that  A  k  Aj  is  invertible,  so  that  yk  is  defined  uniquely  as  follows: 

yk^{ATIAI)-lATIsk,  k  =  0,1,2,.... 


By  taking  limits  and  using  sk  ^  s,  we  have  that 


yki 


y,  =  (AjA/r'Afs, 


and  moreover  yj  >  0,  since  yk  >  0  for  all  k.  Hence  we  can  write  s  =  Biyj  with  yj  >  0,  and 
therefore  s  e  N.  □ 


i#7  Exercises 

i#7  12.1  The  following  example  from  [268]  with  a  single  variable  x  e  1R  and  a  single 

equality  constraint  shows  that  strict  local  solutions  are  not  necessarily  isolated.  Consider 

f  x6  sin(l/x)  =  0  ifx^O 
min  x  subject  to  c(x)  =  0,  where  c(x)  =  < 

*  |  0  ifx  =  0. 

(12.96) 

(a)  Show  that  the  constraint  function  is  twice  continuously  differentiable  at  all  x  (including 
at*  =  0)  and  that  the  feasible  points  are*  =  Oandx  =  l/(kn)  for  all  nonzero  integers 

k. 

(b)  Verify  that  each  feasible  point  except  x  —  0  is  an  isolated  local  solution  by  showing  that 
there  is  a  neighborhood  A f  around  each  such  point  within  which  it  is  the  only  feasible 
point. 
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(c)  Verify  that  x  —  0  is  a  global  solution  and  a  strict  local  solution,  but  not  an  isolated  local 
solution 

&  12.2  Is  an  isolated  local  solution  necessarily  a  strict  local  solution?  Explain. 

&  12.3  Does  problem  ( 12.4)  have  a  finite  or  infinite  number  of  local  solutions?  Use  the 

first-order  optimality  conditions  (12.34)  to  justify  your  answer. 

&  12.4  If  /  is  convex  and  the  feasible  region  £2  is  convex,  show  that  local  solutions  of 

the  problem  (12.3)  are  also  global  solutions.  Show  that  the  set  of  global  solutions  is  convex. 
(Hint:  See  Theorem  2.5.) 

&  12.5  Let  v  :  R"  — >•  Rm  be  a  smooth  vector  function  and  consider  the  unconstrained 

optimization  problems  of  minimizing  fix)  where 

f{x)  —  ||u(x)||oo,  /(•*)  =  max  v/(x). 

i=l,2,...,m 

Reformulate  these  (generally  nonsmooth)  problems  as  smooth  constrained  optimization 
problems. 

&  12.6  Can  you  perform  a  smooth  reformulation  as  in  the  previous  question  when  / 

is  defined  by 


f{x)  =  min 


fix)? 


(N.B.  “min”  not  “max.”)  Why  or  why  not? 

&  12.7  Show  that  the  vector  defined  by  (12.15)  satisfies  (12.14)  when  the  first-order 

optimality  condition  ( 12.10)  is  not  satisfied. 

&  12.8  Verify  that  for  the  sequence  {za  }  defined  by  (12.30),  the  function  fix)  =  X\  + 

X2  satisfies  fizk+ i)  >  fizk)  for  k  =  2,3, _  (Hint:  Consider  the  trajectory  z(s)  = 

(— y/l  —  I/s'2,  —  \/s)T  and  show  that  the  function  h)s)  =f  fizis))  has  h'is)  >  0  for  all 

s  >  2.) 

12.9  Consider  the  problem  (12.9).  Specify  two  feasible  sequences  that  approach  the 
maximizing  point  (1,  l)r,  and  show  that  neither  sequence  is  a  decreasing  sequence  for  /. 

12.10  Verify  that  neither  the  LICQ  nor  the  MFCQ  holds  for  the  constraint  set  defined 
by  (12.32)  at  x*  =  (0,  0)r. 

12.11  Consider  the  feasible  set  £2  in  R2  defined  by  xi  >  0,xi  <  xf. 

(a)  For  x*  =  (0,  0)r,  write  down  7h(x*)  and  Fix*). 

(b)  Is  LICQ  satisfied  at  x*?  Is  MFCQ  satisfied? 
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(c)  If  the  objective  function  is  f(x)  =  —X2,  verify  that  that  KKT  conditions  (12.34)  are 
satisfied  at  x*. 

(d)  Find  a  feasible  sequence  {z* }  approaching  x*  with  f{zk)  <  /(x*)  for  all  k. 

1#7  12.12  It  is  trivial  to  construct  an  example  of  a  feasible  set  and  a  feasible  point  x*  at 

which  the  LICQ  is  satisfied  but  the  constraints  are  nonlinear.  Give  an  example  of  the  reverse 
situation,  that  is,  where  the  active  constraints  are  linear  but  the  LICQ  is  not  satisfied. 

i#7  12.13  Show  that  for  the  feasible  region  defined  by 

(xi  -  l)2  +  (x2  -  l)2  <  2, 

—  1)“  +  (x2  +  1)”  <  2, 

X\  >  0, 

the  MFCQ  is  satisfied  at  x*  —  (0,  0)T  but  the  LICQ  is  not  satisfied. 

i#7  12.14  Consider  the  half  space  defined  by  H  =  [x  e  R"  |  aTx  +  ot  >  0}  where  a  e  R" 

and  a  e  R  are  given.  Formulate  and  solve  the  optimization  problem  for  finding  the  point 
x  in  H  that  has  the  smallest  Euclidean  norm. 

i#7  12.15  Consider  the  following  modification  of  (12.36),  where  t  is  a  parameter  to  be 

fixed  prior  to  solving  the  problem: 


min  ^Xi  —  +  {x2  —  f)4  s.t. 


(a)  For  what  values  of  t  does  the  point  x*  =  (1,  0)r 

(b)  Show  that  when  t  —  1,  only  the  first  constraint 
solution. 

i#7  12.16  (Fletcher  [101])  Solve  the  problem 

min  Xi  +  x2  subject  to  x2  +  xf  =  1 

X 

by  eliminating  the  variable  x2.  Show  that  the  choice  of  sign  for  a  square  root  operation 
during  the  elimination  process  is  critical;  the  “wrong”  choice  leads  to  an  incorrect  answer. 

i#7  12.17  Prove  that  when  the  KKT  conditions  ( 12.34)  and  the  LICQ  are  satisfied  at  a 

point  x*,  the  Lagrange  multiplier  X*  in  (12.34)  is  unique. 


1  —  Xi  —  x2 
1  —  XI  +  x2 
1+  Xi  —  x2 
1  +  X\  +  x2 


>  0. 


(12.97) 


satisfy  the  KKT  conditions? 
is  active  at  the  solution,  and  find  the 
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&  12.18  Consider  the  problem  of  finding  the  point  on  the  parabola  y  —  |(x  —  l)2  that 

is  closest  to  ( x ,  y)  =  (1,  2),  in  the  Euclidean  norm  sense.  We  can  formulate  this  problem  as 

min  f(x,  y)  —  ( x  —  l)2  +  (y  —  2)2  subject  to  (x  —  l)2  =  5y. 

(a)  Find  all  the  KKT  points  for  this  problem.  Is  the  LICQ  satisfied? 

(b)  Which  of  these  points  are  solutions? 

(c)  By  directly  substituting  the  constraint  into  the  objective  function  and  eliminating  the 
variable  x,  we  obtain  an  unconstrained  optimization  problem.  Show  that  the  solutions 
of  this  problem  cannot  be  solutions  of  the  original  problem. 

&  12.19  Consider  the  problem 


f  (1  —  Xi)3  —  x2  >0 
min /(x)  =  —  2xi  +  x2  subject  to  {  , 

*€R2  x2  +  0.25x2  -  1  >  0. 


The  optimal  solution  is  x*  =  (0,  l)r,  where  both  constraints  are  active. 

(a)  Do  the  LICQ  hold  at  this  point? 

(b)  Are  the  KKT  conditions  satisfied? 

(c)  Write  down  the  sets  T(x*')  and  C(x*.  A.*). 

(d)  Are  the  second-order  necessary  conditions  satisfied?  Are  the  second-order  sufficient 
conditions  satisfied? 

&  12.20  Find  the  minima  of  the  function  /(x)  =  XiX2  on  the  unit  circle  x2  +  x\  —  1. 

Illustrate  this  problem  geometrically. 

12.21  Find  the  maxima  of  /(x)  =  xix2  over  the  unit  disk  defined  by  the  inequality 
constraint  1  —  x2  —  x\  >  0. 

12.22  Show  that  for  (12.1),  the  feasible  set  is  convexifc,-,  /  e  £  are  linear  functions 
and  —  Ci,  i  e  X  are  convex  functions. 


Chapter 


Linear 

Prosramming:  The 
Simplex  Method 


Dantzig’s  development  of  the  simplex  method  in  the  late  1 940s  marks  the  start  of  the  modern 
era  in  optimization.  This  method  made  it  possible  for  economists  to  formulate  large  models 
and  analyze  them  in  a  systematic  and  efficient  way.  Dantzig’s  discovery  coincided  with  the 
development  of  the  first  electronic  computers,  and  the  simplex  method  became  one  of  the 
earliest  important  applications  of  this  new  and  revolutionary  technology.  From  those  days 
to  the  present,  computer  implementations  of  the  simplex  method  have  been  continually 
improved  and  refined.  They  have  benefited  particularly  from  interactions  with  numerical 
analysis,  a  branch  of  mathematics  that  also  came  into  its  own  with  the  appearance  of 
electronic  computers,  and  have  now  reached  a  high  level  of  sophistication. 
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Today,  linear  programming  and  the  simplex  method  continue  to  hold  sway  as  the  most 
widely  used  of  all  optimization  tools.  Since  1950,  generations  of  workers  in  management, 
economics,  finance,  and  engineering  have  been  trained  in  the  techniques  of  formulating 
linear  models  and  solving  them  with  simplex-based  software.  Often,  the  situations  they 
model  are  actually  nonlinear,  but  linear  programming  is  appealing  because  of  the  advanced 
state  of  the  software,  guaranteed  convergence  to  a  global  minimum,  and  the  fact  that 
uncertainty  in  the  model  makes  a  linear  model  more  appropriate  than  an  overly  complex 
nonlinear  model.  Nonlinear  programming  may  replace  linear  programming  as  the  method 
of  choice  in  some  applications  as  the  nonlinear  software  improves,  and  a  new  class  of 
methods  known  as  interior-point  methods  (see  Chapter  14)  has  proved  to  be  faster  for 
some  linear  programming  problems,  but  the  continued  importance  of  the  simplex  method 
is  assured  for  the  foreseeable  future. 

LINEAR  PROGRAMMING 

Linear  programs  have  a  linear  objective  function  and  linear  constraints,  which  may 
include  both  equalities  and  inequalities.  The  feasible  set  is  a  polytope,  a  convex,  connected 
set  with  flat,  polygonal  faces.  The  contours  of  the  objective  function  are  planar.  Figure  13.1 
depicts  a  linear  program  in  two-dimensional  space,  in  which  the  contours  of  the  objective 
function  are  indicated  by  dotted  lines.  The  solution  in  this  case  is  unique — a  single  vertex. 
A  simple  reorientation  of  the  polytope  or  the  objective  gradient  c  could  however  make  the 
solution  non-unique;  the  optimal  value  cT  x  could  take  on  the  same  value  over  an  entire 
edge.  In  higher  dimensions,  the  set  of  optimal  points  can  be  a  single  vertex,  an  edge  or  face, 
or  even  the  entire  feasible  set.  The  problem  has  no  solution  if  the  feasible  set  is  empty  (the 
infeasible  case)  or  if  the  objective  function  is  unbounded  below  on  the  feasible  region  (the 
unbounded  case). 

Linear  programs  are  usually  stated  and  analyzed  in  the  following  standard  form: 

min  cT x,  subject  to  Ax  =  b,x  >  0,  (13.1) 

where  c  and  x  are  vectors  in  R'1 ,  b  is  a  vector  in  R'" ,  and  A  is  an  m  x  n  matrix.  Simple 
devices  can  be  used  to  transform  any  linear  program  to  this  form.  For  instance,  given  the 
problem 


min  cT x,  subject  to  Ax  <  b 

(without  any  bounds  on  x),  we  can  convert  the  inequality  constraints  to  equalities  by 
introducing  a  vector  of  slack  variables  z  and  writing 

min  cTx,  subject  to  Ax  +  z  —  b,  z  >  0.  (13.2) 


This  form  is  still  not  quite  standard,  since  not  all  the  variables  are  constrained  to  be 
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nonnegative.  We  deal  with  this  by  splitting  x  into  its  nonnegative  and  nonpositive  parts, 
x  —  x+  —  .v_,  where  x+  =  max(x,  0)  >  0  and  v-  =  max(— x,  0)  >  0.  The  problem  ( 13.2) 
can  now  be  written  as 


min 

c 

—c 

T 

1 

X  X 

1  + 

_ 1 

,  s.t.  [A  -A  I  ] 

1 

5-!  X 

1  + 

_ 1 

=  b, 

i 

X  X 
i  + 

_ i 

0 

z 

z 

Z  J 

which  clearly  has  the  same  form  as  (13.1). 

Inequality  constraints  of  the  form  x  <  u  or  Ax  >  b  always  can  be  converted  to  equality 
constraints  by  adding  or  subtracting  slack  variables  to  make  up  the  difference  between  the 
left-  and  right-hand  sides.  Hence, 

x<u<A>x  +  w  =  u,  w  >  0, 

Ax  >  b  O  Ax  —  y  —  b,  y  >  0. 

(When  we  subtract  the  variables  from  the  left  hand  side,  as  in  the  second  case,  they  are 
sometimes  known  as  surplus  variables.)  We  can  also  convert  a  “maximize”  objective  max  cT x 
into  the  “minimize”  form  of  (13.1)  by  simply  negating  c  to  obtain:  min  (— c)T x. 

We  say  that  the  linear  program  (13.1)  is  infeasible  if  the  feasible  set  is  empty.  We 
say  that  the  problem  (13.1)  is  unbounded  if  the  objective  function  is  unbounded  below  on 
the  feasible  region,  that  is,  there  is  a  sequence  of  points  xk  feasible  for  (13.1)  such  that 
cTxk  4  —  oo.  (Of  course,  unbounded  problems  have  no  solution). 
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Many  linear  programs  arise  from  models  of  transshipment  and  distribution  networks. 
These  problems  have  additional  structure  in  their  constraints;  special-purpose  simplex  algo¬ 
rithms  that  exploit  this  structure  are  highly  efficient.  We  do  not  discuss  such  problems  further 
in  this  book,  except  to  note  that  the  subject  is  important  and  complex,  and  that  a  number 
of  fine  texts  on  the  topic  are  available  (see,  for  example,  Ahuja,  Magnanti,  and  Orlin  [1]). 

For  the  standard  formulation  (13.1),  we  will  assume  throughout  that  m  <  n.  Other¬ 
wise,  the  system  Ax  —  b  contains  redundant  rows,  or  is  infeasible,  or  defines  a  unique  point. 
When  m  >  n,  factorizations  such  as  the  QR  or  LU  factorization  (see  Appendix  A)  can  be 
used  to  transform  the  system  Ax  —  b  to  one  with  a  coefficient  matrix  of  full  row  rank. 


1 3-1  OPTIMALITY  AND  DUALITY 

OPTIMALITY  CONDITIONS 

Optimality  conditions  for  the  problem  (13.1)  can  be  derived  from  the  theory  of 
Chapter  1 2 .  Only  the  first-order  conditions — the  Karush-Kuhn-Tucker  ( KKT )  conditions — 
are  needed.  Convexity  of  the  problem  ensures  that  these  conditions  are  sufficient  for  a  global 
minimum.  We  do  not  need  to  refer  to  the  second-order  conditions  from  Chapter  12,  which 
are  not  informative  in  any  case  because  the  Hessian  of  the  Lagrangian  for  (13.1)  is  zero. 

The  theory  we  developed  in  Chapter  12  make  derivation  of  optimality  and  duality 
results  for  linear  programming  much  easier  than  in  other  treatments,  where  this  theory  is 
developed  more  or  less  from  scratch. 

The  KKT  conditions  follow  from  Theorem  12. 1 .  As  stated  in  Chapter  12,  this  theorem 
requires  linear  independence  of  the  active  constraint  gradients  (LICQ).  However,  as  we 
noted  in  Section  12.6,  the  result  continues  to  hold  for  dependent  constraints  provided  they 
are  linear,  as  is  the  case  here. 

We  partition  the  Lagrange  multipliers  for  the  problem  (13.1)  into  two  vectors  X  and 
s,  where  X  e  R'"  is  the  multiplier  vector  for  the  equality  constraints  Ax  —  b,  while  s  e  R" 
is  the  multiplier  vector  for  the  bound  constraints  x  >  0.  Using  the  definition  (12.33),  we 
can  write  the  Lagrangian  function  for  (13.1)  as 

C{x,  X ,  s)  —  cTx  —  XT{Ax  —  b)  —  sT x.  (13.3) 

Applying  Theorem  12.1,  we  find  that  the  first-order  necessary  conditions  for  x *  to  be  a 
solution  of  (13.1)  are  that  there  exist  vectors  X  and  s  such  that 


AT  X  +  s  —  c,  (13.4a) 

Ax  —  £>,  (13.4b) 

x  >  0,  (13.4c) 

s  >  0,  (13.4d) 

Xj Sj  =  0,  i  —  1,2, ,  n.  (13. 4e) 
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The  complementarity  condition  (13.4e),  which  essentially  says  that  at  least  one  of  the 
components  X/  ands,-  must  be  zero  for  each  /  =  1,2, ,  n,  is  often  written  in  the  alternative 
form  xTs  =  0.  Because  of  the  nonnegativity  conditions  (13.4c),  (13.4d),  the  two  forms  are 
identical. 

Let  (x*,  X*,  s *)  denote  a  vector  triple  that  satisfies  (13.4).  By  combining  the  three 
equalities  (13.4a),  (13.4d),  and  (13.4e),  we  find  that 

cTx*  =  ( ArX *  +  s*)Tx*  =  ( Ax*)tX *  =  bTX* .  (13.5) 

As  we  shall  see  in  a  moment,  bT X  is  the  objective  function  for  the  dual  problem  to  (13.1), 
so  (13.5)  indicates  that  the  primal  and  dual  objectives  are  equal  for  vector  triples  (x,  X,  s) 
that  satisfy  (13.4). 

It  is  easy  to  show  directly  that  the  conditions  (13.4)  are  sufficient  for  x*  to  be  a  global 
solution  of  ( 13.1).  Let  x  be  any  other  feasible  point,  so  that  Ax  —  b  and  x  >  0.  Then 

cTx  —  ( AX *  +  s*)Tx  —  bTX*  +  xTs*  >  bTX*  —  cTx*.  (13.6) 

We  have  used  (13.4)  and  (13.5)  here;  the  inequality  relation  follows  trivially  from  x  >  0  and 
s*  >  0.  The  inequality  (13.6)  tells  us  that  no  other  feasible  point  can  have  a  lower  objective 
value  than  cT x*.  We  can  say  more:  The  feasible  point  x  is  optimal  if  and  only  if 

xTs*  =  0, 

since  otherwise  the  inequality  in  (13.6)  is  strict.  In  other  words,  when  s*  >  0,  then  we  must 
have  Xj  =  0  for  all  solutions  x  of  (13.1). 

THE  DUAL  PROBLEM 

Given  the  data  c,  b,  and  A,  which  defines  the  problem  (13.1),  we  can  define  another, 
closely  related,  problem  as  follows: 

max  bT X,  subject  to  AT X  <  c.  (13.7) 

This  problem  is  called  the  dual  problem  for  (13.1).  In  contrast,  (13.1)  is  often  referred  to  as 
the  primal.  We  can  restate  (13.7)  in  a  slightly  different  form  by  introducing  a  vector  of  dual 
slack  variables  s,  and  writing 

max  bT X,  subject  to  AT X  +  5  =  c,  s  >  0.  (13.8) 

The  variables  (X,s)  in  this  problem  are  sometimes  jointly  referred  to  collectively  as  dual 
variables. 
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The  primal  and  dual  problems  present  two  different  viewpoints  on  the  same  data. 
Their  close  relationship  becomes  evident  when  we  write  down  the  KKT  conditions  for 
( 13.7).  Let  us  first  restate  (13.7)  in  the  form 

min  —  bT  X  subject  to  c  —  AT  X  >  0, 

to  fit  the  formulation  (12.1)  from  Chapter  12.  By  using  x  e  R”  to  denote  the  Lagrange 
multipliers  for  the  constraints  ATX  <  c,  we  see  that  the  Lagrangian  function  is 


£( X,  x)  —  —bTX  —  xT(c  —  AT X). 


Using  Theorem  12.1  again,  we  find  the  first-order  necessary  conditions  for  X  to  be  optimal 
for  (13.7)  to  be  that  there  exists  .c  such  that 


Ax  —  b , 

(13.9a) 

ATX  <  c, 

(13.9b) 

x  >  0, 

(13.9c) 

x,(c  —  ATX)i  =  0,  /  =  1,2,.. 

. ,  n. 

(13. 9d) 

Defining  s  =  c  —  AT X  (as  in  (13.8)),  we  find  that  the  conditions  (13.9)  and  (13.4)  are 
identical!  The  optimal  Lagrange  multipliers  X  in  the  primal  problem  are  the  optimal  variables 
in  the  dual  problem,  while  the  optimal  Lagrange  multipliers  x  in  the  dual  problem  are  the 
optimal  variables  in  the  primal  problem. 

Analogously  to  (13.6),  we  can  show  that  (13.9)  are  in  fact  sufficient  conditions  for  a 
solution  of  the  dual  problem  (13.7).  Given  x*  and  X *  satisfying  these  conditions  (so  that  the 
triple  ( x ,  X ,  s)  —  (x*,  X*,  c  —  ATX*)  satisfies  (13.4)),  we  have  for  any  other  dual  feasible 
point  X  (with  ATX  <  c)  that 

bTX  =  ( x*)tAtX 

=  (x*)T  {AtX  —  c)  +  cTx* 

<  cT x*  because  ATX  —  c  <  0  andx*  >  0 

—  bTX*  from  (13.5). 

Hence  X*  achieves  the  maximum  of  the  dual  objective  bT X  over  the  dual  feasible  region 
AT X  <  c,  so  it  solves  the  dual  problem  (13.7). 

The  primal-dual  relationship  is  symmetric;  by  taking  the  dual  of  the  dual  problem 
( 13.7),  we  recover  the  primal  problem  (13.1).  We  leave  the  proof  of  this  claim  as  an  exercise. 

Given  a  feasible  vector  x  for  the  primal  (satisfying  Ax  —  b  and  x  >  0)  and  a  feasible 
point  (X,  s )  for  the  dual  (satisfying  AT X  +  s  =  c,  s  >  0),  we  have  as  in  (13.6)  that 


cT x  —  bT X  —  (c  —  ATX)T x  —  sT x  >  0. 


(13.10) 
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Therefore  we  have  cTx  >  bT  X  (that  is,  the  dual  objective  is  a  lower  bound  on  the  primal 
objective)  when  both  the  primal  and  dual  variables  are  feasible — a  result  known  as  weak 
duality. 

The  following  strong  duality  result  is  fundamental  to  the  theory  of  linear  programming. 

Theorem  13.1  (Strong  Duality). 

(i)  If  either  problem  (13.1)  or  (13.7)  has  a  (finite)  solution,  then  so  does  the  other,  and  the 
objective  values  are  equal. 

(ii)  If  either  problem  (13.1 )  or  (13.7)  is  unbounded,  then  the  other  problem  is  infeasible. 

Proof.  For  (i),  suppose  that  (13.1)  has  a  finite  optimal  solution  x*.  It  follows  from  The¬ 
orem  12.1  that  there  are  vectors  X*  and  s *  such  that  (x*,  X*,  s*)  satisfies  (13.4).  We  noted 
above  that  (13.4)  and  ( 13.9)  are  equivalent,  and  that  (13.9)  are  sufficient  conditions  for  A*  to 
be  a  solution  of  the  dual  problem  (13.7).  Moreover,  it  follows  from  (13.5)  that  cT  x*  —  bTX*, 
as  claimed. 

A  symmetric  argument  holds  if  we  start  by  assuming  that  the  dual  problem  (13.7)  has 
a  solution. 

To  prove  (ii),  suppose  that  the  primal  is  unbounded,  that  is,  there  is  a  sequence  of 
points  xk,  k  —  1,  2,  3,  . . .  such  that 

cTxk  ^  — oo,  Axk  —  b,  xk  >  0. 

Suppose  too  that  the  dual  (13.7)  is  feasible,  that  is,  there  exists  a  vector  X  such  that  ATX  <  c. 
From  the  latter  inequality  together  with  xk  >  0,  we  have  that  XT  Axk  <  cTxk,  and  therefore 

XT b  =  XT  Axk  <  cTxk  4-  —  oo, 

yielding  a  contradiction.  Hence,  the  dual  must  be  infeasible. 

A  similar  argument  can  be  used  to  show  that  unboundedness  of  the  dual  implies 
infeasibility  of  the  primal.  □ 

As  we  showed  in  the  discussion  following  Theorem  12.1,  the  multiplier  values  X 
and  s  for  (13.1)  indicate  the  sensitivity  of  the  optimal  objective  value  to  perturbations  in 
the  constraints.  In  fact,  the  process  of  finding  (X,  s)  for  a  given  optimal  x  is  often  called 
sensitivity  analysis.  Considering  the  case  of  perturbations  to  the  vector  b  (the  right-hand  side 
in  (13.1)  and  objective  gradient  in  (13.7)),  we  can  make  an  informal  argument  to  illustrate 
the  sensitivity.  Suppose  that  this  small  change  produces  small  perturbations  in  the  primal 
and  dual  solutions,  and  that  the  vectors  A.?  and  Ax  have  zeros  in  the  same  locations  as  s 
and  x,  respectively.  Since  x  and  s  are  complementary  (see  (13.4e))  it  follows  that 


0  =  xT  s  —  xT  As  —  (A  x)T  s  —  (Ax)rAs. 
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We  have  from  Theorem  13.1  that  the  optimal  objectives  of  the  primal  and  dual  problems 
are  equal,  for  both  the  original  and  perturbed  problems,  so 

cT x  —  bTX,  cT{x  +  Ax )  —  (b+  A  b)T  {X  +  AX). 

Moreover,  by  feasibility  of  the  perturbed  solutions  in  the  perturbed  problems,  we  have 

A(x  +  Ax)  =  b  +  A  b,  AT  AX  —  —As. 

Hence,  the  change  in  optimal  objective  due  to  the  perturbation  is  as  follows: 

cT  Ax  —  (b  +  A  b)T(X  +  AX)  —  bT  X 
=  (b  +  Ab)T  AX  +  {Ab)T  X 

—  (x  +  Ax)t  At  AX  +  (A  b)T  X 

—  (x  +  Ax)rA.s  +  {Ab)TX 
=  (A  b)TX. 

In  particular,  if  A b  —  eej ,  where  ej  is  the  j  th  unit  vector  in  Rm ,  we  have  for  all  e  sufficiently 
small  that 

cT  Ax  —  eXj.  (13.11) 

That  is,  the  change  in  optimal  objective  is  Xj  times  the  size  of  the  perturbation  to  bj,  if  the 
perturbation  is  small. 


1 3-2  GEOMETRY  OF  THE  FEASIBLE  SET 

BASES  AND  BASIC  FEASIBLE  POINTS 

We  assume  for  the  remainder  of  the  chapter  that 

The  matrix  A  in  (13.1)  has  full  row  rank.  (13.12) 

In  practice,  a  preprocessing  phase  is  applied  to  the  user-supplied  data  to  remove  some 
redundancies  from  the  given  constraints  and  eliminate  some  of  the  variables.  Reformulation 
by  adding  slack,  surplus,  and  artificial  variables  can  also  result  in  A  satisfying  the  property 
(13.12)  . 

Each  iterate  generated  by  the  simplex  method  is  a  basic  feasible  point  of  (13.1).  A 
vector  x  is  a  basic  feasible  point  if  it  is  feasible  and  if  there  exists  a  subset  B  of  the  index  set 
{1,2 ,  ....  n]  such  that 
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•  B  contains  exactly  m  indices; 

•  i  f  B  =>■  Xt  —  0  (that  is,  the  bound  x;  >  0  can  be  inactive  only  if  i  e  B); 

•  The  m  x  m  matrix  B  defined  by 


B  =  [Ai\ieB  (13.13) 

is  nonsingular,  where  A,-  is  the  r  th  column  of  A. 

A  set  B  satisfying  these  properties  is  called  a  basis  for  the  problem  (13.1).  The  corresponding 
matrix  B  is  called  the  basis  matrix. 

The  simplex  method’s  strategy  of  examining  only  basic  feasible  points  will  converge 
to  a  solution  of  (13.1)  only  if 

(a)  the  problem  has  basic  feasible  points;  and 

(b)  at  least  one  such  point  is  a  basic  optimal  point,  that  is,  a  solution  of  ( 13.1)  that  is  also  a 
basic  feasible  point. 

Happily,  both  (a)  and  (b)  are  true  under  reasonable  assumptions,  as  the  following  result 
(sometimes  known  as  the  fundamental  theorem  of  linear  programming)  shows. 

Theorem  13.2. 

(i)  If  (13.1)  has  a  nonempty  feasible  region,  then  there  is  at  least  one  basic  feasible  point; 

(ii)  If  (13.1)  has  solutions,  then  at  least  one  such  solution  is  a  basic  optimal  point. 

(iii)  If  (13.1)  is  feasible  and  bounded,  then  it  has  an  optimal  solution. 

Proof.  Among  all  feasible  vectors  x,  choose  one  with  the  minimal  number  of  nonzero 
components,  and  denote  this  number  by  p.  Without  loss  of  generality,  assume  that  the 
nonzeros  are  xu  x2, . . . ,  xp,  so  we  have 


p 

T.  AiX;  —  b. 

i= 1 

Suppose  first  that  the  columns  Ai,  A2, . . . ,  Ap  are  linearly  dependent.  Then  we  can 
express  one  of  them  (Ap,  say)  in  terms  of  the  others,  and  write 


p- 1 

Ap  =  J2AiZi’ 

i=  1 


(13.14) 
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for  some  scalars  z  1 ,  Z2 ,  • . . ,  zP- 1  •  It  is  easy  to  check  that  the  vector 

x(e)  —  x  +  e(zi,Z2,  ■  •  • ,  zp~  1,  — 1,  0,  0, . . . ,  0)T  —  x  +  ez  (13.15) 

satisfies  Ax(e)  —  b  for  any  scalar  e.  In  addition,  since  x,-  >  0  for  i  —  1,2,...,  p,  we  also 
have  Xj  (e)  >  0  for  the  same  indices  i  =  1,2 , . . . ,  p  and  all  e  sufficiently  small  in  magnitude. 
However,  there  is  a  value  e  e  (0,  xp\  such  that  x,(e)  =  0  for  some  i  —  1,2, ,  p.  Hence, 
x(e)  is  feasible  and  has  at  most  p  —  1  nonzero  components,  contradicting  our  choice  of  p 
as  the  minimal  number  of  nonzeros. 

Therefore,  columns  A\,  Ai, . . . ,  Ap  must  be  linearly  independent,  and  so  p  <  m.  If 
p  =  m,  we  are  done,  since  then  x  is  a  basic  feasible  point  and  B  is  simply  {1,2 , ...  ,m). 
Otherwise  p  <  m  and,  because  A  has  full  row  rank,  we  can  choose  m  —  p  columns  from 
among  Ap+l,  Ap+ 2, . . . ,  A„  to  build  up  a  set  of  m  linearly  independent  vectors.  We  construct 
B  by  adding  the  corresponding  indices  to  { 1,  2,  ....  p}.  The  proof  of  (i)  is  complete. 

The  proof  of  (ii)  is  quite  similar.  Let  x*  be  a  solution  with  a  minimal  number  of 
nonzero  components  p,  and  assume  again  that  x*,  x* , . . . ,  x*  are  the  nonzeros.  If  the 
columns  Aj,  A2, . . . ,  Ap  are  linearly  dependent,  we  define 

x*(e)  =  x*  +  ez, 

where  z  is  chosen  exactly  as  in  (13.14),  (13.15).  It  is  easy  to  check  that  x*(e)  will  be  feasible 
for  all  e  sufficiently  small,  both  positive  and  negative.  Hence,  since  x*  is  optimal,  we  must 
have 


cT(x*  +  ez)  >  cT x*  =x  ecTz  >  0 

foralle  sufficiently  small  (positive  andnegative).  Therefore,  cT  z  =  0andsocrx*(e)  =  cTx* 
for  all  e.  The  same  logic  as  in  the  proof  of  (i)  can  be  applied  to  find  e  >  0  such  that  x*(e)  is 
feasible  and  optimal,  with  at  most  p  —  1  nonzero  components.  This  contradicts  our  choice 
of  p  as  the  minimal  number  of  nonzeros,  so  the  columns  A\,  A2, . . . ,  Ap  must  be  linearly 
independent.  We  can  now  apply  the  same  reasoning  as  above  to  conclude  that  x*  is  already 
a  basic  feasible  point  and  therefore  a  basic  optimal  point. 

The  final  statement  (iii)  is  a  consequence  of  finite  termination  of  the  simplex  method. 
We  comment  on  the  latter  property  in  the  next  section.  □ 

The  terminology  we  use  here  is  not  quite  standard,  as  the  following  table  shows: 

our  terminology  terminology  used  elsewhere 

basic  feasible  point  basic  feasible  solution 

basic  optimal  point  optimal  basic  feasible  solution 

The  standard  terms  arose  because  “solution”  and  “feasible  solution”  were  originally  used 
as  synonyms  for  “feasible  point.”  However,  as  the  discipline  of  optimization  developed, 
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Figure  13.2 

Vertices  of  a 

three-dimensional  polytope 
(indicated  by  *). 


the  word  “solution”  took  on  a  more  specific  and  intuitive  meaning  (as  in  “solution  to  the 
problem” ) .  We  maintain  consistency  with  the  rest  of  the  book  by  following  this  more  modern 
usage. 


VERTICES  OF  THE  FEASIBLE  POLYTOPE 

The  feasible  set  defined  by  the  linear  constraints  is  a  polytope,  and  the  vertices  of  this 
polytope  are  the  points  that  do  not  lie  on  a  straight  line  between  two  other  points  in  the  set. 
Geometrically,  they  are  easily  recognizable;  see  Figure  13.2.  Algebraically,  the  vertices  are 
exactly  the  basic  feasible  points  defined  above.  We  therefore  have  an  important  relationship 
between  the  algebraic  and  geometric  viewpoints  and  a  useful  aid  to  understanding  how  the 
simplex  method  works. 

Theorem  13.3. 

All  basic  feasible  points for  (13.1)  areverticesofthefeasiblepolytope{x  \  Ax  —b,x>  0}, 
and  vice  versa. 

PROOF.  Let  X  be  a  basic  feasible  point  and  assume  without  loss  of  generality  that  B  = 
{1,2,...,  in}.  The  matrix  B  —  [A,  ]I= 12 . m  is  therefore  nonsingular,  and 

V«-t- 1  —  %m+2  —  ‘  —  2C,i  —  0.  (13.16) 

Suppose  that  x  lies  on  a  straight  line  between  two  other  feasible  points  y  and  z,.  Then  we 
can  find  a  e  (0,  1)  such  that  x  —  ay  +  (1  —  a)z ■  Because  of  ( 13.16)  and  the  fact  that  a  and 
1  —  a  are  both  positive,  we  must  have  y,  —  Zi  —  0  for  i  —  m  +  1,  m  +  2 Writing 
xB  =  (jci,  X2, . . . ,  xm ) 7  and  defining  yB  and  zB  likewise,  we  have  from  Ax  —  Ay  =  Az  —  b 
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that 


Bxb  —  ByB  —  Bzb  —  b, 

and  so,  by  nonsingularity  of  B,  we  have  xB  —  yB  —  zB.  Therefore,  x  =  y  —  z,  contradicting 
our  assertion  that  y  and  z  are  two  feasible  points  other  than  x.  Therefore,  x  is  a  vertex. 

Conversely,  let  x  be  a  vertex  of  the  feasible  polytope,  and  suppose  that  the  nonzero 
components  of  x  are  X\,x2,  ...  ,xp.  If  the  corresponding  columns  A\,  A2,  ...,  Ap  are 
linearly  dependent,  then  we  can  construct  the  vector  x(e)  —  x  +  ez  as  in  (13.15).  Since 
x(e)  is  feasible  for  all  e  with  sufficiently  small  magnitude,  we  can  define  e  >  0  such  that 
x(i )  and  x(—i)  are  both  feasible.  Since  x  —  x(0)  obviously  lies  on  a  straight  line  between 
these  two  points,  it  cannot  be  a  vertex.  Hence  our  assertion  that  A\,  A2, . . . ,  Ap  are  linearly 
dependent  must  be  incorrect,  so  these  columns  must  be  linearly  independent  and  p  <  m. 
If  p  <  m ,  and  since  A  has  full  row  rank,  we  can  add  m  —  p  indices  to  {1,  2, . . . ,  p]  to 
form  a  basis  B,  for  which  x  is  the  corresponding  basic  feasible  point.  This  completes  our 
proof.  □ 

We  conclude  this  discussion  of  the  geometry  of  the  feasible  set  with  a  definition  of 
degeneracy.  This  term  has  a  variety  of  meanings  in  optimization,  as  we  discuss  in  Chapter  16. 
For  the  purposes  of  this  chapter,  we  use  the  following  definition. 

Definition  13.1  (Degeneracy). 

A  basis  B  is  said  to  be  degenerate  ifxj  —  0  for  some  i  e  B,  where  x  is  the  basic  feasible 
solution  corresponding  to  B.  A  linear  program  (13.1 )  is  said  to  be  degenerate  if  it  has  at  least 
one  degenerate  basis. 


13.3  THE  SIMPLEX  METHOD 

OUTLINE 

In  this  section  we  give  a  detailed  description  of  the  simplex  method  for  (13.1).  There 
are  actually  a  number  of  variants  the  simplex  method;  the  one  described  here  is  sometimes 
known  as  the  revised  simplex  method.  (We  will  describe  an  alternative  known  as  the  dual 
simplex  method  in  Section  13.6.) 

As  we  described  above,  all  iterates  of  the  simplex  method  are  basic  feasible  points  for 
(13.1)  and  therefore  vertices  of  the  feasible  polytope.  Most  steps  consist  of  a  move  from  one 
vertex  to  an  adjacent  one  for  which  the  basis  B  differs  in  exactly  one  component.  On  most 
steps  (but  not  all),  the  value  of  the  primal  objective  function  cTx  is  decreased.  Another  type 
of  step  occurs  when  the  problem  is  unbounded:  The  step  is  an  edge  along  which  the  objective 
function  is  reduced,  and  along  which  we  can  move  infinitely  far  without  ever  reaching  a 
vertex. 
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The  major  issue  at  each  simplex  iteration  is  to  decide  which  index  to  remove  from  the 
basis  B.  Unless  the  step  is  a  direction  of  unboundedness,  a  single  index  must  be  removed 
from  B  and  replaced  by  another  from  outside  B.  We  can  gain  some  insight  into  how  this 
decision  is  made  by  looking  again  at  the  KKT  conditions  (13.4). 

From  B  and  (13.4),  we  can  derive  values  for  not  just  the  primal  variable  x  but  also 
the  dual  variables  (X,  s),  as  we  now  show.  First,  define  the  nonbasic  index  set  A f  as  the 
complement  of  B,  that  is, 


Af={l,2,...,n}\B.  (13.17) 

Just  as  B  is  the  basic  matrix,  whose  columns  are  A,  for  i  e  B,  we  use  N  to  denote  the  nonbasic 
matrix  N  =  [A/],-€jv.  We  also  partition  the  ^-element  vectors  x,  s,  and  c  according  to  the 
index  sets  B  and  AT,  using  the  notation 

-Tb  —  VN  =  \_Xj  ] /g.V', 

■Tb  —  [AiLeS,  =  [•hli'eA/-! 

Cb  =  [Ci']iei3,  Cn  =  [Ci]i€j\f- 

From  the  KKT  condition  (13.4b),  we  have  that 

Ax  —  Bx  B  +  Nxn  —  b. 

The  primal  variable  x  for  this  simplex  iterate  is  defined  as 

xB  =  B~lb,  xN  =  0.  (13.18) 

Since  we  are  dealing  only  with  basic  feasible  points,  we  know  that  B  is  nonsingular  and 
that  xB  >  0,  so  this  choice  of  x  satisfies  two  of  the  KKT  conditions:  the  equality  constraints 
(13.4b)  and  the  nonnegativity  condition  (13.4c). 

We  choose  s  to  satisfy  the  complementarity  condition  (13.4e)  by  setting  sB  =  0.  The 
remaining  components  X  and  sN  can  be  found  by  partitioning  this  condition  into  cB  and  cN 
components  and  using  sB  —  0  to  obtain 

BT  X  —  cB,  NT  X  +  =  cN.  (13.19) 

Since  B  is  square  and  nonsingular,  the  first  equation  uniquely  defines  X  as 

X  —  B~tcb.  (13.20) 

The  second  equation  in  ( 13. 19)  implies  a  value  for  ,yN: 


JN  =  cN  -  NTX  —  cN  —  {B  1N)tcb. 


(13.21) 
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Computation  of  the  vector  sN  is  often  referred  to  as  pricing.  The  components  of  sN  are  often 
called  the  reduced  costs  of  the  nonbasic  variables  xN. 

The  only  KKT  condition  that  we  have  not  enforced  explicitly  is  the  nonnegativity 
condition  s  >  0.  The  basic  components  sB  certainly  satisfy  this  condition,  by  our  choice 
sB  —  0.  If  the  vector  sN  defined  by  (13.21)  also  satisfies  sN  >  0,  we  have  found  an  optimal 
vector  triple  (x,  X,  s),  so  the  algorithm  can  terminate  and  declare  success.  Usually,  however, 
one  or  more  of  the  components  of  sN  are  negative.  The  new  index  to  enter  the  basis  B — the 
entering  index — is  chosen  to  be  one  of  the  indices  q  e  Af  for  which  sq  <  0.  As  we  show  below, 
the  objective  cTx  will  decrease  when  we  allow  xq  to  become  positive  if  and  only  if  (i)  sq  <0 
and  (ii)  it  is  possible  to  increase  xq  away  from  zero  while  maintaining  feasibility  of  x.  Our 
procedure  for  altering  B  and  changing  x  and  s  can  be  described  accordingly  as  follows: 

•  allow  xq  to  increase  from  zero  during  the  next  step; 

•  fix  all  other  components  of  xN  at  zero,  and  figure  out  the  effect  of  increasing  xq  on  the 
current  basic  vector  xB,  given  that  we  want  to  stay  feasible  with  respect  to  the  equality 
constraints  Ax  —  b; 

•  keep  increasing  xq  until  one  of  the  components  of  xB  (xp,  say)  is  driven  to  zero,  or 
determining  that  no  such  component  exists  (the  unbounded  case); 

•  remove  index  p  (known  as  the  leaving  index)  from  B  and  replace  it  with  the  entering 
index  q. 

This  process  of  selecting  entering  and  leaving  indices,  and  performing  the  algebraic 
operations  necessary  to  keep  track  of  the  values  of  the  variables  x,  X,  and  s,  is  sometimes 
known  as  pivoting. 

We  now  formalize  the  pivoting  procedure  in  algebraic  terms.  Since  both  the  new 
iterate  x+  and  the  current  iterate  x  should  satisfy  Ax  —  b,  and  since  xN  =  0  and  xf  —  0  for 
i  e  J\f\{q},  we  have 


Ax+  —  Bxf  +  Aqx+  —  Bxb  —  Ax. 

By  multiplying  this  expression  by  B~l  and  rearranging,  we  obtain 

x+  =  xB  -  B~lAqx+.  (13.22) 

Geometrically  speaking,  ( 1 3.22)  is  usually  a  move  along  an  edge  of  the  feasible  polytope  that 
decreases  cT x.  We  continue  to  move  along  this  edge  until  a  new  vertex  is  encountered.  At 
this  vertex,  a  new  constraint  xp  >  0  must  have  become  active,  that  is,  one  of  the  components 
xp,  p  e  B,  has  decreased  to  zero.  We  then  remove  this  index  p  from  the  basis  B  and  replace 
it  by  g. 
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We  now  show  how  the  step  defined  by  ( 13.22)  affects  the  value  of  cTx.  From  (13.22), 
we  have 


cT_x+  =  cBrx+  +  cqx+  =  c[x B  -  cBr5  +  c9x+.  (13.23) 

From  (13.20)  we  have  cjFM1  =  while  from  the  second  equation  in  (13. 19),  since  q  e  AC, 
we  have  AqX  —  cq  —  sq.  Therefore, 

clB-lAqx+  =  ArA9x+  =  (c9  -  sq)x+, 

so  by  substituting  in  (13.23)  we  obtain 

crx+  =  cBrxB  -  (c9  -  sq)x+  +  c9x+  =  crx  +  sqx+ .  (13.24) 

Since  q  was  chosen  to  have  sq  <  0,  it  follows  that  the  step  ( 13.22)  produces  a  decrease  in 
the  primal  objective  function  cTx  whenever  x+  >  0. 

It  is  possible  that  we  can  increase  x+  to  oo  without  ever  encountering  a  new  vertex. 
In  other  words,  the  constraint  X+  =  x„  —  B~l  Aqxq  >  0  holds  for  all  positive  values  of  x+. 
When  this  happens,  the  linear  program  is  unbounded;  the  simplex  method  has  identified  a 
ray  that  lies  entirely  within  the  feasible  polytope  along  which  the  objective  cT x  decreases 
to  — oo. 

Figure  13.3  shows  a  path  traversed  by  the  simplex  method  for  a  problem  in  R2.  In  this 
example,  the  optimal  vertex  x*  is  found  in  three  steps. 

If  the  basis  B  is  nondegenerate  (see  Definition  13.1),  then  we  are  guaranteed  that 
x+  >  0,  so  we  can  be  assured  of  a  strict  decrease  in  the  objective  function  cTx  at  this  step.  If 


Figure  13.3 

Simplex  iterates  for 
a  two-dimensional 
problem. 
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the  problem  (13.1)  is  nondegenerate,  we  can  ensure  a  decrease  in  cTx  at  every  step,  and  can 
therefore  prove  the  following  result  concerning  termination  of  the  simplex  method. 

Theorem  13.4. 

Provided  that  the  linear  program  (13.1)  is  nondegenerate  and  bounded,  the  simplex 
method  terminates  at  a  basic  optimal  point. 

PROOF.  The  simplex  method  cannot  visit  the  same  basic  feasible  point  x  at  two  different 
iterations,  because  it  attains  a  strict  decrease  at  each  iteration.  Since  the  number  of  possible 
bases  B  is  finite  (there  are  only  a  finite  number  of  ways  to  choose  a  subset  of  m  indices 
from  {1,  2,  . . . ,  «}),  and  since  each  basis  defines  a  single  basic  feasible  point,  there  are  only 
a  finite  number  of  basic  feasible  points.  Hence,  the  number  of  iterations  is  finite.  Moreover, 
since  the  method  is  always  able  to  take  a  step  away  from  a  nonoptimal  basic  feasible  point, 
and  since  the  problem  is  not  unbounded,  the  method  must  terminate  at  a  basic  optimal 
point.  □ 

This  result  gives  us  a  proof  of  Theorem  13.2  (iii)  in  the  case  in  which  the  linear 
program  is  nondegenerate.  The  proof  of  finite  termination  is  considerably  more  complex 
when  nondegeneracy  of  (13.1)  is  not  assumed,  as  we  discuss  at  the  end  of  Section  13.5. 


A  SINGLE  STEP  OF  THE  METHOD 

We  have  covered  most  of  the  mechanics  of  taking  a  single  step  of  the  simplex  method. 
To  make  subsequent  discussions  easier  to  follow,  we  summarize  our  description. 

Procedure  13.1  (One  Step  of  Simplex). 

Given  B,  A f,  xB  =  B~lb  >  0,  xN  =  0; 

Solve  BT X  —  cB  for  X, 

Compute  sN  =  cN  —  NTX;  (*  pricing  *) 

if  sN  >  0 

stop;  (*  optimal  point  found  *) 

Select  q  e  Af  with  sq  <  0  as  the  entering  index; 

Solve  Bd  =  Aq  ford; 
ifd  <  0 

stop;  (*  problem  is  unbounded  *) 

Calculate  =  min,- 1  d.  >0  (xB),- /dj ,  and  use  p  to  denote  the  minimizing  i ; 

Update  =  xB  —  dx+,  =  (0, . . . ,  0,  0, . . . ,  0)T; 

Change  B  by  adding  q  and  removing  the  basic  variable  corresponding  to  column  p  of  B. 


We  illustrate  this  procedure  with  a  simple  example. 
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□  Example  1 3.1 

Consider  the  problem 


min  — 4xy  —  2x2  subject  to 
Xi  +  x2  +  x3  =  5, 

2xi  “I-  (1/2)x2  X4  —  8, 
x  >  0. 

Suppose  we  start  with  the  basis  B  =  {3,  4},  for  which  we  have 
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and  an  objective  value  of  cT x  —  0.  Since  both  elements  of  sN  are  negative,  we  could  choose 
either  1  or  2  to  be  the  entering  variable.  Suppose  we  choose  q  —  1.  We  obtain  d  —  ( 1,  2)T , 
so  we  cannot  (yet)  conclude  that  the  problem  is  unbounded.  By  performing  the  ratio 
calculation,  we  find  that  p  —  2  (corresponding  to  the  index  4)  and  x+  =  4.  We  update  the 
basic  and  nonbasic  index  sets  to  B  =  {3,  1}  and  Af  —  {4,  2},  and  move  to  the  next  iteration. 

At  the  second  iteration,  we  have 
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with  an  objective  value  of  — 12.  We  see  that  sN  has  one  negative  component,  corresponding 
to  the  index  q  =  2,  so  we  select  this  index  to  enter  the  basis.  We  obtain  d  —  (3/2,  — l/2)r, 
so  again  we  do  not  detect  unboundedness.  Continuing,  we  find  that  the  maximum  value  of 
x2+  is  4/3,  and  that  p  —  1,  which  indicates  that  index  3  will  leave  the  basis  B.  We  update  the 
index  sets  to  B  —  {2,  1}  and  AT  =  {4,  3}  and  continue. 

At  the  start  of  the  third  iteration,  we  have 
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with  an  objective  value  of  cT x  —  —41/3.  We  see  that  sN  >  0,  so  the  optimality  test  is 
satisfied,  and  we  terminate. 

□ 


We  need  to  flesh  out  Procedure  13.1  with  specifics  of  three  important  aspects  of  the 
implementation: 
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•  Linear  algebra  issues — maintaining  an  LU  factorization  of  B  that  can  be  used  to  solve 
for  X  and  d. 

•  Selection  of  the  entering  index  q  from  among  the  negative  components  of  sN.  (In 
general,  there  are  many  such  components.) 

•  Handling  of  degenerate  bases  and  degenerate  steps,  in  which  it  is  not  possible  to 
choose  a  positive  value  of  xq  without  violating  feasibility. 

Proper  handling  of  these  issues  is  crucial  to  the  efficiency  of  a  simplex  implementation.  We 
give  some  details  in  the  next  three  sections. 


1 3.4  LINEAR  ALGEBRA  IN  THE  SIMPLEX  METHOD 


We  have  to  solve  two  linear  systems  involving  the  matrix  B  at  each  step;  namely, 

BtX  =  cb,  Bd  =  Aq.  (13.25) 

We  never  calculate  the  inverse  basis  matrix  B~l  explicitly  just  to  solve  these  systems.  Instead, 
we  calculate  or  maintain  some  factorization  of  B — usually  an  LU  factorization — and  use 
triangular  substitutions  with  the  factors  to  recover  X  and  d.  It  is  less  expensive  to  update  the 
factorization  than  to  calculate  it  afresh  at  each  iteration  because  the  basis  matrix  B  changes 
by  just  a  single  column  between  iterations. 

The  standard  factorization/updating  procedures  start  with  an  LU  factorization  of  B 
at  the  first  iteration  of  the  simplex  algorithm.  Since  in  practical  applications  B  is  large 
and  sparse,  its  rows  and  columns  are  rearranged  during  the  factorization  to  maintain  both 
numerical  stability  and  sparsity  of  the  L  and  U  factors.  One  successful  pivot  strategy  that 
trades  off  between  these  two  aims  was  proposed  by  Markowitz  in  1957  [202];  it  is  still  used 
as  the  basis  of  many  practical  sparse  LU  algorithms.  Other  considerations  may  also  enter 
into  our  choice  of  row  and  column  reordering  of  B .  For  example,  it  may  help  to  improve  the 
efficiency  of  the  updating  procedure  if  as  many  as  possible  of  the  leading  columns  of  U  con¬ 
tain  just  a  single  nonzero,  on  the  diagonal.  Many  heuristics  have  been  devised  for  choosing 
row  and  column  permutations  that  produce  this  and  other  desirable  structural  features. 

Let  us  assume  for  simplicity  that  row  and  column  permutations  are  already 
incorporated  in  B,  so  that  we  write  the  initial  LU  factorization  as 

LU  =  B ,  (13.26) 

(L  is  unit  lower  triangular,  U  is  upper  triangular).  The  system  Bd  =  Aq  can  then  be  solved 
by  the  following  two-step  procedure: 


Ld  —  Aq,  Ud  —  d. 


(13.27) 
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Figure  13.4  Left:  L~lB+,  which  is  upper  triangular  except  for  the  column  occupied 
by  Ap.  Right:  After  cyclic  row  and  column  permutation  Pi,  the  non-upper  triangular 
part  of  Pi  L~lB+  Pf  appears  in  the  last  row. 


Similarly,  the  system  BT  X  =  cB  is  solved  by  performing  the  following  two  triangular 
substitutions: 


UTX  =  cB,  LtX  =  X. 

We  now  discuss  a  procedure  for  updating  the  factors  L  and  U  after  one  step  of  the 
simplex  method,  when  the  index  p  is  removed  from  the  basis  B  and  replaced  by  the  index 
q.  The  corresponding  change  to  the  basis  matrix  B  is  that  the  column  Bp  is  removed  from 
B  and  replaced  by  Aq .  We  call  the  resulting  matrix  B+  and  note  that  if  we  rewrite  ( 13.26)  as 
U  —  L~1B,  the  modified  matrix  L~1B+  will  be  upper  triangular  except  in  column  p.  That 
is,  L~lB+  has  the  form  shown  on  the  left  in  Figure  13.4. 

We  now  perform  a  cyclic  permutation  that  moves  column  p  to  the  last  column  position 
m  and  moves  columns  p+l,  p  +  2,  . . .  ,m  one  position  to  the  left  to  make  room  for  it.  If  we 
apply  the  same  permutation  to  rows  p  through  m,  the  net  effect  is  to  move  the  non-upper 
triangular  part  to  the  last  row  of  the  matrix,  as  shown  in  Figure  13.4.  If  we  denote  the 
permutation  matrix  by  Pi,  the  matrix  illustrated  at  right  in  Figure  13.4  is  PiL~{ B+  P[ . 

Finally,  we  perform  sparse  Gaussian  elimination  on  the  matrix  P\L~lB+  P[  to  restore 
upper  triangular  form.  That  is,  we  find  Li  and  U i  (lower  and  upper  triangular,  respectively) 
such  that 


PiL~lB+Pf  =  LiUi.  (13.28) 

It  is  easy  to  show  that  L\  and  Ui  have  a  simple  form.  The  lower  triangular  matrix  L\ 
differs  from  the  identity  only  in  the  last  row,  while  Ui  is  identical  to  P\L~l B+  Pf  except 
that  the  (m,m)  element  is  changed  and  the  off-diagonal  elements  in  the  last  row  are 
eliminated. 
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We  give  details  of  this  process  for  the  case  of  m  —  5.  Using  the  notation 
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and  supposing  that  p  —  2  (so  that  the  second  column  is  replaced  by  L  1  Aq),  we  have 


Mil 

Wl 

M 13 

M 14 

M 15 

W2 

M23 

W24 

M25 

1 

+ 

II 

W3 

M33 

M34 

M35 

W44 

M45 

- 

M55 

- 

After  the  cyclic  permutation  Pi ,  we  have 
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The  factors  L\  and  U\  are  now  as  follows: 
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(13.30) 

for  certain  values  of/52, /s3, /s4,  and  w 2  (see  Exercise  13.10). 

The  result  of  this  updating  process  is  the  factorization  (13.28),  which  we  can  rewrite 
as  follows: 


B+  =  L+U+,  where  L+  =  LPf  Lu  U+  =  U^. 


(13.31) 
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There  is  no  need  to  calculate  L+  and  U+  explicitly.  Rather,  the  nonzero  elements  in  L\  and 
the  last  column  of  U\ ,  and  the  permutation  information  in  P\ ,  can  be  stored  in  compact 
form,  so  that  triangular  substitutions  involving  L+  and  U+  can  be  performed  by  applying 
a  number  of  permutations  and  sparse  triangular  substitutions  involving  these  factors.  The 
factorization  updates  from  subsequent  simplex  steps  are  stored  and  applied  in  a  similar 
fashion. 

The  procedure  we  have  just  outlined  is  due  to  Forrest  and  Tomlin  [110].  It  is  quite 
efficient,  because  it  requires  the  storage  of  little  data  at  each  update  and  does  not  require  much 
movement  of  data  in  memory.  Its  major  disadvantage  is  possible  numerical  instability.  Large 
elements  in  the  factors  of  a  matrix  are  a  sure  indicator  of  instability,  and  the  multipliers  in 
the  Li  factor  (/52  in  (13.30),  for  example)  may  be  very  large.  An  earlier  scheme  of  Bartels  and 
Golub  [12]  allowed  swapping  of  rows  to  avoid  these  problems.  For  instance,  if  |  M33 1  <  |  M23 1 
in  (13.29),  we  could  swap  rows  2  and  5  to  ensure  that  the  subsequent  multiplier  Z52  in  the  L\ 
factor  does  not  exceed  1  in  magnitude.  This  improved  stability  comes  at  a  price:  The  lower 
right  corner  of  the  upper  triangular  factor  may  become  more  dense  during  each  update. 

Although  the  update  information  for  each  iteration  (the  permutation  matrices  and 
the  sparse  triangular  factors)  can  often  be  stored  in  a  highly  compact  form,  the  total  amount 
of  space  may  build  up  to  unreasonable  levels  after  many  such  updates  have  been  performed. 
As  the  number  of  updates  builds  up,  so  does  the  time  needed  to  solve  for  the  vectors 
d  and  X  in  Procedure  13.1.  If  an  unstable  updating  procedure  is  used,  numerical  errors 
may  also  come  into  play,  blocking  further  progress  by  the  simplex  algorithm.  For  all  these 
reasons,  most  simplex  implementations  periodically  calculate  a  fresh  LU  factorization  of 
the  current  basis  matrix  B  and  discard  the  accumulated  updates.  The  new  factorization 
uses  the  same  permutation  strategies  that  we  apply  to  the  very  first  factorization,  which 
balance  the  requirements  of  stability,  sparsity,  and  structure. 


13-5  OTHER  IMPORTANT  DETAILS 

PRICING  AND  SELECTION  OF  THE  ENTERING  INDEX 

There  are  usually  many  negative  components  of  ,sN  at  each  step.  Flow  do  we  choose 
one  of  these  to  become  the  index  that  enters  the  basis?  Ideally,  we  would  like  to  choose  the 
sequence  of  entering  indices  q  that  gets  us  to  the  solution  x*  in  the  fewest  possible  steps, 
but  we  rarely  have  the  global  perspective  needed  to  implement  this  strategy.  Instead,  we  use 
more  shortsighted  but  practical  strategies  that  obtain  a  significant  decrease  in  cTx  on  just 
the  present  iteration.  There  is  usually  a  tradeoff  between  the  effort  spent  on  finding  a  good 
entering  index  and  the  amount  of  decrease  in  cT x  resulting  from  this  choice.  Different  pivot 
strategies  resolve  this  tradeoff  in  different  ways. 

Dantzig’s  original  selection  rule  is  one  of  the  simplest.  It  chooses  q  such  that  sq  is 
the  most  negative  component  of  sN  =  NT X.  This  rule,  which  is  motivated  by  (13.24),  gives 
the  maximum  improvement  in  cT x  per  unit  increase  in  the  entering  variable  xq .  A  large 
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reduction  in  cT  x  is  not  guaranteed,  however.  It  could  be  that  we  can  increase  xq  only  a  tiny 
amount  from  zero  (or  not  at  all)  before  reaching  the  next  vertex. 

Calculation  of  the  entire  vector  sN  fr  om  ( 1 3 .2 1 )  requires  a  multiplication  by  N  T ,  which 
can  be  expensive  when  the  matrix  N  is  very  large.  Partial  pricing  strategies  calculate  only  a 
subvector  of  sN  and  make  the  choice  of  entering  variable  from  among  the  negative  entries  in 
this  subvector.  To  give  all  the  indices  in  Af  a  chance  to  enter  the  basis,  these  strategies  cycle 
through  the  nonbasic  elements,  periodically  changing  the  subvector  of  sN  they  evaluate  so 
that  no  nonbasic  index  is  ignored  for  too  long. 

Neither  of  these  strategies  guarantees  that  we  can  make  a  substantial  move  along  the 
chosen  edge  before  reaching  a  new  vertex.  Multiple  pricing  strategies  are  more  thorough:  For 
a  small  subset  of  indices  q  e  A f,  they  evaluate  sq  and,  if  sq  <  0,  the  maximum  value  of 
that  maintains  feasibility  of  x+  and  the  consequent  change  sqxq  in  the  objective  function 
(see  (13.24)).  Calculation  of  xq  requires  evaluation  of  d  =  B~1Aq  as  in  Procedure  13.1, 
which  is  not  cheap.  Subsequent  iterations  deal  with  this  same  index  subset  until  we  reach 
an  iteration  at  which  all  sq  are  nonnegative  for  q  in  the  subset.  At  this  point,  the  full  vector 
sN  is  computed,  a  new  subset  of  nonbasic  indices  is  chosen,  and  the  cycle  begins  again.  This 
approach  has  the  advantage  that  the  columns  of  the  matrix  N  outside  the  current  subset  of 
priced  components  need  not  be  accessed  at  all,  so  memory  access  in  the  implementation  is 
quite  localized. 

Naturally,  it  is  possible  to  devise  heuristics  that  combine  partial  and  multiple  pricing 
in  various  imaginative  ways. 

A  sophisticated  rule  known  as  steepest  edge  chooses  the  “most  downhill”  direction 
from  among  all  the  candidates — the  one  that  produces  the  largest  decrease  in  cTx  per  unit 
distance  moved  along  the  edge.  (By  contrast,  Dantzig’s  rule  maximizes  the  decrease  in  cTx 
per  unit  change  in  xq,  which  is  not  the  same  thing,  as  a  small  change  in  xq  can  correspond 
to  a  large  distance  moved  along  the  edge.)  During  the  pivoting  step,  the  overall  change  in  x  is 


x+  — 


Xb 

XN 


x  =  x  +  r\qx 


(13.32) 


where  eq  is  the  unit  vector  with  a  1  in  the  position  corresponding  to  the  index  q  e  Af  and 
zeros  elsewhere,  and  the  vector  qq  is  defined  as 


r  -B-iAq  ■ 

i 

"^3 

1 

1 _ 

i 

J 

see  (13.25).  The  change  in  cTx  per  unit  step  along  qq  is  given  by 

cT>lq 

IW 


(13.33) 


(13.34) 


The  steepest-edge  rule  chooses  q  e  Af  to  minimize  this  quantity. 
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If  we  had  to  compute  each  77,  by  solving  Bd  =  A;  for  each  i  e  AT,  the  steepest-edge 
strategy  would  be  prohibitively  expensive.  Goldfarb  and  Reid  [134]  showed  that  the  measure 
(13.34)  of  edge  steepness  for  all  indices  i  e  N  can,  in  fact,  be  updated  quite  economically 
at  each  iteration.  We  outline  their  steepest-edge  procedure  by  showing  how  each  cT rjj  and 
||  77/ 1|  can  be  updated  at  the  current  iteration. 

First,  note  that  we  already  know  the  numerator  cT rj,  in  (13.34)  without  calculating 
rii ,  because  by  taking  the  inner  product  of  (13.32)  with  c  and  using  (13.24),  we  have  that 
cT  ?/,■  =  Sj .  To  investigate  the  change  in  denominator  ||  77,- 1|  at  this  step,  we  define  y,  =  ||  rjj  || 2, 
where  this  quantity  is  defined  before  and  after  the  update  as  follows: 

Yi  =  M2  =  \\B-lAi\\2  +  1,  (13.35a) 

K,+  =  lk+H2  =  IKfi+rkill2  +  1.  (13.35b) 


Assume  without  loss  of  generality  that  the  entering  column  Aq  replaces  the  first  column  of 
the  basis  matrix  B  (that  is,  p  —  1),  and  that  this  column  corresponds  to  the  index  t.  We  can 
then  express  the  update  to  B  as  follows: 

B  =  B  -T  {Aq  —  A/)e^  —  B  - 1-  {Aq  —  B 6\)6^ ^  (13.36) 


where  e\  —  (1,  0,  0, ... ,  0)r.  By  applying  the  Sherman-Morrison  formula  (A.27)  to  the 
rank-one  update  formula  in  ( 13.36),  we  obtain 


(B+)-1  =  B~x 


{B  lAq  -  ei)e[ B  1  _  _  {d  -  ei)ef  B  1 

1  +  e\ {B~lAq  -  ei)  ejd 


where  again  we  have  used  the  fact  that  d  =  B  lAq  (see  (13.25)).  Therefore,  we  have  that 


{B+)~lAi  =  B~lAj  - 


ejB-'A, 

e[d 


{d  -  ei). 


By  substituting  for  ( B+ )  1  A,  in  (13.35)  and  performing  some  simple  manipulation,  we 
obtain 


y,+  =  K/  —  2  (  — 


e  B~x A; 


'  \  r,-T 


e[  d 


A  B~*  d 


e[B  'A, 
e\d 


Yq- 


(13.37) 


Once  we  solve  the  following  two  linear  systems  to  obtain  d  and  r. 


BT  d  —  d,  BT  r  —  e\. 


(13.38) 
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The  formula  (13.37)  then  becomes 


y,+  =  Yi 


rT  Aj 
rTA„ 


dT  A i  + 


r  Aj 
rTA„ 


Yq- 


(13.39) 


Hence,  the  entire  set  of  values,  for  i  e  A f  with  i  ^  q,  can  be  calculated  by  solving  the  two 
systems  (13.38)  and  then  evaluating  the  inner  products  rT  A,  and  dT  Aj,  for  each  i. 

The  steepest-edge  strategy  does  not  guarantee  that  we  can  take  a  long  step  before 
reaching  another  vertex,  but  it  has  proved  to  be  highly  effective  in  practice. 


STARTING  THE  SIMPLEX  METHOD 

The  simplex  method  requires  a  basic  feasible  starting  point  x  and  a  corresponding 
initial  basis  B  C  { 1 ,  2 , . . . ,  n }  with  \B\  —  m  such  that  the  basis  matrix  B  defined  by  ( 1 3 . 1 3 ) 
is  nonsingular  and  xB  =  B~lb  >  OandxN  =  0.  The  problem  of  finding  this  initial  point  and 
basis  may  itself  be  nontrivial — in  fact,  its  difficulty  is  equivalent  to  that  of  actually  solving 
a  linear  program.  We  describe  here  the  two-phase  approach  that  is  commonly  used  to  deal 
with  this  difficulty  in  practical  implementations. 

In  Phase  I  of  this  approach  we  set  up  an  auxiliary  linear  program  based  on  the  data 
of  (13.1),  and  solve  it  with  the  simplex  method.  The  Phase-I  problem  is  designed  so  that  an 
initial  basis  and  initial  basic  feasible  point  is  trivial  to  find,  and  so  that  its  solution  gives  a 
basic  feasible  initial  point  for  the  second  phase.  In  Phase  II,  a  second  linear  program  similar 
to  the  original  problem  (13.1)  is  solved,  with  the  Phase-I  solution  as  a  starting  point.  The 
solution  of  the  original  problem  (13.1)  can  be  extracted  easily  from  the  solution  of  the 
Phase-II  problem. 

In  Phase  I  we  introduce  artificial  variables  z  into  (13.1)  and  redefine  the  objective 
function  to  be  the  sum  of  these  artificial  variables,  as  follows: 

min  eT  z,  subject  to  Ax  +  Ez  —  b,  (x,  z)  >  0,  (13.40) 

where  z  €  R"\  e  =  (1,  1, ...,  l)r,  and  £  is  a  diagonal  matrix  whose  diagonal  elements  are 

Ejj  =  +1  if  bj  >  0,  Ejj  =  -1  if  bj  <  0. 

It  is  easy  to  see  that  the  point  (x,  z)  defined  by 

x  —  0,  Zj  —  \bj\,  j  —  (13.41) 

is  a  basic  feasible  point  for  (13.40).  Obviously,  this  point  satisfies  the  constraints  in  ( 13.40), 
while  the  initial  basis  matrix  B  is  simply  the  diagonal  matrix  E,  which  is  clearly  nonsingular. 

At  any  feasible  point  for  (13.40),  the  artificial  variables  z  represent  the  amounts  by 
which  the  constraints  Ax  —  b  are  violated  by  the  x  component.  The  objective  function  is 
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simply  the  sum  of  these  violations,  so  by  minimizing  this  sum  we  are  forcing  x  to  become 
feasible  for  the  original  problem  (13.1).  It  is  not  difficult  to  see  that  the  Phase-I  problem 

(13.40)  has  an  optimal  objective  value  of  zero  if  and  only  if  the  original  problem  (13.1)  is 
feasible,  by  using  the  following  argument:  If  there  exists  a  vector  (x,  5)  that  is  feasible  for 

(13.40)  such  that  eTz  =  0,  we  must  have  5  =  0,  and  therefore  Ax  =  b  and  x  >  0,  so  x  is 
feasible  for  (13.1).  Conversely,  if  x  is  feasible  for  (13.1),  then  the  point  (x,  0)  is  feasible  for 

(13.40)  with  an  objective  value  of  0.  Since  the  objective  in  (13.40)  is  obviously  nonnegative 
at  all  feasible  points,  then  (x,  0)  must  be  optimal  for  (13.40),  verifying  our  claim. 

In  Phase  I,  we  apply  the  simplex  method  to  (13.40)  from  the  initial  point  (13.41).  This 
linear  program  cannot  be  unbounded,  because  its  objective  function  is  bounded  below  by  0, 
so  the  simplex  method  will  terminate  at  an  optimal  point  (assuming  that  it  does  not  cycle;  see 
below).  If  the  objective  eTz  is  positive  at  this  solution,  we  conclude  by  the  argument  above 
that  the  original  problem  (13.1)  is  infeasible.  Otherwise,  the  simplex  method  identifies  a 
point  (x,  5)  with  eTz  =  0,  which  is  also  a  basic  feasible  point  for  the  following  Phase-II 
problem : 


mincrx  subject  to  Ax  +  z  =  b,  x  >  0,  0  >  z  >  0.  (13.42) 

Note  that  this  problem  differs  from  (13.40)  in  that  the  objective  function  is  replaced  by  the 
original  objective  cTx,  while  upper  bounds  of  0  have  been  imposed  on  z.  In  fact,  (13.42)  is 
equivalent  to  ( 13.1),  because  any  solution  (and  indeed  any  feasible  point)  must  have  z  =  0. 
We  need  to  retain  the  artificial  variables  z  in  Phase  II,  however,  since  some  components  of  z 
may  still  be  present  in  the  optimal  basis  from  Phase  I  that  we  are  using  as  the  initial  basis  for 
(13.42),  though  of  course  the  values  Zj  of  these  components  must  be  zero.  In  fact,  we  can 
modify  (13.42)  to  include  only  those  components  of  z  that  are  present  in  the  optimal  basis 
for  (13.40). 

The  problem  (13.42)  is  not  quite  in  standard  form  because  of  the  two-sided  bounds 
on  z.  However,  it  is  easy  to  modify  the  simplex  method  described  above  to  handle  upper 
and  lower  bounds  on  the  variables  (we  omit  the  details).  We  can  customize  the  simplex 
algorithm  slightly  by  deleting  each  component  of  z  from  the  problem  (13.42)  as  soon  as  it 
is  swapped  out  of  the  basis.  This  strategy  ensures  that  components  of  z  do  not  repeatedly 
enter  and  leave  the  basis,  thereby  avoiding  unnecessary  simplex  iterations. 

If  (x*,  z*)  is  a  basic  solution  of  (13.42),  it  must  have  z*  —  0,  and  so  x*  is  a  solution 
of  (13.1).  In  fact,  x*  is  a  basic  feasible  point  for  (13.1),  though  this  claim  is  not  completely 
obvious  because  the  final  basis  B  for  the  Phase-II  problem  may  still  contain  components  of 
Z*,  making  it  unsuitable  as  an  optimal  basis  for  (13.1).  Since  A  has  full  row  rank,  however, 
we  can  construct  an  optimal  basis  for  (13.1)  in  a  postprocessing  phase:  Extract  from  B  any 
components  of  z,  that  are  present,  and  replace  them  with  nonbasic  components  of  x  in  a 
way  that  maintains  nonsingularity  of  the  submatrix  B  defined  by  (13.13). 

A  final  point  to  note  is  that  in  many  problems  we  do  not  need  to  add  a  complete  set  of 
m  artificial  variables  to  form  the  Phase-I  problem.  This  observation  is  particularly  relevant 
when  slack  and  surplus  variables  have  already  been  added  to  the  problem  formulation,  as 
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in  (13.2),  to  obtain  a  linear  program  with  inequality  constraints  in  standard  form  (13.1). 
Some  of  these  slack/surplus  variables  can  play  the  roles  of  artificial  variables,  making  it 
unnecessary  to  include  such  variables  explicitly. 

We  illustrate  this  point  with  the  following  example. 


□  Example  1 3.2 

Consider  the  inequality-constrained  linear  program  defined  by 

min  3xi  +  X2  +  X3  subject  to 
2xi  +  xa  +  X3  <  2, 
xi  —  X2  —  X3  <  —  1, 

x  >  0. 

By  adding  slack  variables  to  both  inequality  constraints,  we  obtain  the  following  equivalent 
problem  in  standard  form: 


min  3xi  +  X2  +  X3  subject  to 
2x'i  -j-  X2  +  X3  +  X4  —  2, 

X'l  —  X2  —  X3  +  X5  =  —  1, 

x  >  0. 

By  inspection,  it  is  easy  to  see  that  the  vector  x  =  (0,  0,  0,  2,  0)  is  feasible  with  respect  to 
the  first  linear  constraint  and  the  lower  bound  x  >  0,  though  it  does  not  satisfy  the  second 
constraint.  Hence,  in  forming  the  Phase-I  problem,  we  add  just  a  single  artificial  variable  zi 


to  the  second  constraint  and  obtain 

min  z2  subject  to  (13.43) 

2xi  -f-  X2  -f-  X3  X4  —  2,  (13.44) 

X\  —  x2  —  x3  +  x5  —  Z2  —  ~  1,  (13.45) 

(x,  Zi)  >  0.  (13.46) 


It  is  easy  to  see  that  the  vector  (x,  zi)  —  ((0,  0,  0,  2,  0),  1)  is  feasible  with  respect  to  (13.43). 
In  fact,  it  is  a  basic  feasible  point,  since  the  corresponding  basis  matrix  B  is 


B  = 


1  0 
0  -1 


which  is  clearly  nonsingular.  In  this  example,  the  variable  x4  plays  the  role  of  artificial  variable 
for  the  first  constraint.  There  was  no  need  to  add  an  explicit  artificial  variable  Z\  .  _ 
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DEGENERATE  STEPS  AND  CYCLING 

As  noted  above,  the  simplex  method  may  encounter  situations  in  which  for  the 
entering  index  q,  we  cannot  set  any  greater  than  zero  in  (13.22)  without  violating  the 
nonnegativity  condition  x+  >  0.  By  referring  to  Procedure  13.1,  we  see  that  these  situations 
arise  when  there  is  i  with  (xB);  =  0  and  d,  <  0,  where  d  is  defined  by  (13.25).  Steps  of 
this  type  are  called  degenerate  steps.  On  such  steps,  the  components  of  x  do  not  change 
and,  therefore,  the  objective  function  cT x  does  not  decrease.  However,  the  steps  may  still 
be  useful  because  they  change  the  basis  B  (by  replacing  one  index),  and  the  updated  B 
may  be  closer  to  the  optimal  basis.  In  other  words,  the  degenerate  step  may  be  laying  the 
groundwork  for  reductions  in  cTx  on  later  steps. 

Sometimes,  however,  a  phenomenon  known  as  cycling  can  occur.  After  a  number  of 
successive  degenerate  steps,  we  may  return  to  an  earlier  basis  B.  If  we  continue  to  apply  the 
algorithm  from  this  point  using  the  same  rules  for  selecting  entering  and  leaving  indices, 
we  will  repeat  the  same  cycle  ad  infinitum,  never  converging. 

Cycling  was  once  thought  to  be  a  rare  phenomenon,  but  in  recent  times  it  has  been 
observed  frequently  in  the  large  linear  programs  that  arise  as  relaxations  of  integer  pro¬ 
gramming  problems.  Since  integer  programs  are  an  important  source  of  linear  programs, 
practical  simplex  codes  usually  incorporate  a  cycling  avoidance  strategy. 

In  the  remainder  of  this  section,  we  describe  a  perturbation  strategy  and  its  close 
relative,  the  lexicographic  strategy. 

Suppose  that  a  degenerate  basis  is  encountered  at  some  simplex  iteration,  at  which 
the  basis  is  B  and  the  basis  matrix  is  B,  say.  We  consider  a  modified  linear  program  in 
which  we  add  a  small  perturbation  to  the  right-hand  side  of  the  constraints  in  (13.1),  as 
follows: 


b{e)  =  b  +  B 


where  e  is  a  very  small  positive  number.  This  perturbation  in  b  induces  a  perturbation  in 
the  components  of  the  basic  solution  vector;  we  have 


e 


m 


Xj(e)  =  Xi  + 


(13.47) 
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Retaining  the  perturbation  for  subsequent  iterations,  we  see  that  subsequent  basic 
solutions  have  the  form 


=  *B  +  (13.48) 

k= 1 

where  {B~lB).k  denotes  the  kth  column  of  B~lB  and  xB  represents  the  basic  solution  for 
the  unperturbed  right-hand  side  b. 

From  (13.47),  we  have  that  for  all  e  sufficiently  small  (but  positive),  (xji(c))i  >  0  for 
all  i .  Hence,  the  basis  is  nondegenerate  for  the  perturbed  problem,  and  we  can  perform  a 
step  of  the  simplex  method  that  produces  a  nonzero  (but  tiny)  decrease  in  the  objective. 

Indeed,  if  we  retain  the  perturbation  over  all  subsequent  iterations,  and  provided 
that  the  initial  choice  of  6  was  small  enough,  we  claim  that  all  subsequent  bases  visited  by 
the  algorithm  are  nondegenerate.  We  prove  this  claim  by  contradiction,  by  assuming  that 
there  is  some  basis  matrix  B  such  that  (xB(e)),-  =  0  for  some  i  and  all  e  sufficiently  small. 
From  (13.48),  we  see  that  this  can  happen  only  when  (x„);  =  0  and  (B~1B)u!  —  0  for 
k  —  1,2 , ,m.  The  latter  relation  implies  that  the  /  th  row  of  B~lB  is  zero,  which  cannot 
occur,  because  both  B  and  B  are  nonsingular. 

We  conclude  that,  provided  the  initial  choice  of  e  is  sufficiently  small  to  ensure 
nondegeneracy  of  all  subsequent  bases,  no  basis  is  visited  more  than  once  by  the  simplex 
method  and  therefore,  by  the  same  logic  as  in  the  proof  of  Theorem  13.4,  the  method 
terminates  finitely  at  a  solution  of  the  perturbed  problem.  The  perturbation  can  be  removed 
in  a  postprocessing  phase,  by  resetting  xB  —  B~lb  for  the  final  basis  B  and  the  original 
right-hand  side  b. 

The  question  remains  of  how  to  choose  e  small  enough  at  the  point  at  which  the 
original  degenerate  basis  B  is  encountered.  The  lexicographic  strategy  finesses  this  issue  by 
not  making  an  explicit  choice  of  e,  but  rather  keeping  track  of  the  dependence  of  each  basic 
variable  on  each  power  of  e .  When  it  comes  to  selecting  the  leaving  variable,  it  chooses  the 
index  p  that  minimizes  (xB(e)),  /d(-  over  all  variables  in  the  basis,  for  all  sufficiently  small 
e.  (The  choice  of  p  is  uniquely  defined  by  this  procedure,  as  we  can  show  by  an  argument 
similar  to  the  one  above  concerning  nondenegeracy  of  each  basis.)  We  can  extend  the  pivot 
procedure  slightly  to  update  the  dependence  of  each  basic  variable  on  the  powers  of  e  at 
each  iteration,  including  the  variable  xq  that  has  just  entered  the  basis. 


xB(e)  —  xB  +  B  lB 


1 3.6  THE  DUAL  SIMPLEX  METHOD 


Here  we  describe  another  variant  of  the  simplex  method  that  is  useful  in  a  variety  of  situations 
and  is  often  faster  on  many  practical  problems  than  the  variant  described  above.  This  dual 
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simplex  method  uses  many  of  the  same  concepts  and  methodology  described  above,  such  as 
the  splitting  of  the  matrix  A  into  column  submatrices  B  and  N  and  the  generation  of  iterates 
( x ,  X,  s)  that  satisfy  the  complementarity  condition  xTs  —  0.  The  method  of  Section  13.3 
starts  with  a  feasible  x  (with  xB  >  0  and  xN  =  0)  and  a  corresponding  dual  iterate  (X,  s) 
for  which  sB  =  0  but  sN  is  not  necessarily  nonnegative.  After  making  systematic  column 
interchanges  between  B  and  N,  it  finally  reaches  a  feasible  dual  point  ( X ,  s )  at  which  sN  >  0, 
thus  yielding  a  solution  of  both  the  primal  problem  (13.1)  and  the  dual  (13.8).  By  contrast, 
the  dual  simplex  method  starts  with  a  point  {X,  s)  feasible  for  (13.8),  at  which  sN  >  0 
and  sB  —  0,  and  a  corresponding  primal  feasible  point  x  for  which  xN  =  0  but  xB  is  not 
necessarily  nonnegative.  By  making  systematic  column  interchanges  between  B  and  N,  it 
finally  reaches  a  feasible  primal  point  x  for  which  xB  >  0,  signifying  optimality.  Note  that 
although  the  matrix  B  used  in  this  algorithm  is  a  nonsingular  column  submatrix  of  A,  it 
is  no  longer  correct  to  refer  to  it  as  a  basis  matrix,  since  it  does  not  satisfy  the  feasibility 
condition xB  —  B^lb  >  0. 

We  now  describe  a  single  step  of  this  method  in  a  similar  fashion  to  Section  13.3, 
though  the  details  are  a  little  more  complicated  here.  As  mentioned  above,  we  commence 
each  step  with  submatrices  B  and  IV  of  A,  and  corresponding  sets  B  and  AT.  The  primal  and 
dual  variables  corresponding  to  these  sets  are  defined  as  follows  (cf.  (13.18),  (13.20),  and 
(13.21)): 


xB  =  B~1b ,  xN  =  0,  (13.49a) 

X  =  B~T  cB,  (13.49b) 

sb  =  cb-BtX  =  0,  sN  =  cN  -  NtX  >  0,  (13.49c) 

If  xB  >  0,  the  current  point  (x,  X,  s)  satisfies  the  optimality  conditions  (13.4),  and  we  are 
done.  Otherwise,  we  select  a  leaving  index  q  e  B  such  that  xq  <  0.  Our  aim  is  to  move  xq  to 
zero  (thereby  ensuring  that  nonnegativity  holds  for  this  component),  while  allowing  sq  to 
increase  away  from  zero.  We  will  also  identify  an  entering  index  r  e  AT,  such  that  sr  becomes 
zero  on  this  step  while  xr  increases  away  from  zero.  Hence,  the  index  q  will  move  from  B  to 
AT,  while  r  will  move  from  AT  to  B.  How  do  we  choose  r,  and  how  are  x,  X,  and  s  changed 
on  this  step?  The  description  below  provides  the  answer.  We  use  (x+,  1+,  5+)  to  denote  the 
updated  values  of  our  variables,  after  this  step  is  taken. 

First,  let  eq  the  vector  of  length  m  that  contains  all  zeros  except  for  a  1  in  the  position 
occupied  by  index  q  in  the  set  B.  Since  we  increase  sq  away  from  zero  while  fixing  the 
remaining  components  of  sB  at  zero,  the  updated  value  ■sBl"  will  have  the  form 

s'j1"  =  sB  +  oteq  (13.50) 

for  some  positive  scalar  a  to  be  determined.  We  write  the  corresponding  update  to  X  as 


A+  =  X  +  otv, 


(13.51) 
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for  some  vector  v.  In  fact,  since  and  A+  must  satisfy  the  first  equation  in  (13.49c),  we 
must  have 

s+  =  cB  -  BTX+ 

=>■  sB  +  aeq  —  cB  —  BT  (X  +  a  v) 

^eq  =  -BTv,  (13.52) 

which  is  a  system  of  equations  that  we  can  solve  to  obtain  v. 

To  see  how  the  dual  objective  value  bT  X  changes  as  a  result  of  this  step,  we  use  (13.52) 
and  the  fact  that  xq  —  x l  eq  to  obtain 

bT  X+  —  bT  X  +  abT  v 

—  bT X  —  abT  B~t eq  from  (13.52) 

—  bT X  —  axl eq  from  (13.49a) 

=  bT X  —  axq  by  definition  of  eq . 

Since  xq  <  0  and  since  our  aim  is  to  maximize  the  dual  objective,  we  would  like  to  choose  a 
as  large  as  possible.  The  upper  bound  on  a  is  provided  by  the  constraint  s+  >  0.  Similarly 
to  (13.49c),  we  have 

=  cN  —  NT  X+  —  jN  —  aNT  v  —  5n  —  aw, 

where  we  have  defined 

w  =  Ntv  =  -NTB~Teq. 

The  largest  a  for  which >  0  is  given  by  the  formula 

si 

a  —  mm 

j&M,Wj>0  Wj 

We  define  the  entering  index  r  to  be  the  index  at  which  the  minimum  in  this  expression  is 
achieved.  Note  that 

s,+  =  0  and  wr  =  Aj v  >  0,  (13.53) 

where,  as  usual,  Ar  denotes  the  rth  column  of  A. 

Having  now  identified  how  X  and  s'  are  updated  on  this  step,  we  need  to  figure  out 
how  x  changes.  For  the  leaving  index  q,  we  need  to  set  xq  —  0,  while  for  the  entering  index 
r  we  can  allow  ,r(+  to  be  nonzero.  We  denote  the  direction  of  change  for  xB  to  be  the  vector 
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d,  defined  by  the  following  linear  system: 

Bd  =  J2A<di  =  Ar.  (13.54) 

ieB 


Since  from  (13.49a),  we  have 


A>Xi  =  b ’ 
ieB 


we  have  that 


T.  Aj(xj  -  ydj)  +  Ary  —  b,  (13.55) 

ieB 


for  any  scalar  y .  To  ensure  that  xq  =  0,  we  set 


which  is  well  defined  only  if  dq  is  nonzero.  In  fact,  we  have  that  dq  <  0,  since 


(13.56) 


dq  —  dT eq  =  A ].  B  7 eq  —  —A7 v  —  —wr  <  0, 


where  we  have  used  the  definition  of  eq  along  with  (13.54),  (13.52),  and  (13.53)  to  derive 
these  relationships.  Since  xq  <  0,  it  follows  from  (13.56)  that  y  >  0.  Following  (13.55)  we 
can  define  the  updated  vector  x+  as  follows: 


Xj  —  yd, ,  for  i  e  B  with  i  ^  q, 
0,  for  i  —  q, 

0,  for  i  e  A f  with  i  ^  r, 

y,  for  i  —  r. 


13.7  PRESOLVING 


Presolving  (also  known  as  preprocessing)  is  carried  out  in  practical  linear  programming 
codes  to  reduce  the  size  of  the  user-defined  linear  programming  problem  before  passing  it 
to  the  solver.  A  variety  of  techniques — some  obvious,  some  ingenious — are  used  to  eliminate 
certain  variables,  constraints,  and  bounds  from  the  problem.  Often  the  reduction  in  problem 
size  is  quite  dramatic,  and  the  linear  programming  algorithm  takes  much  less  time  when 
applied  to  the  presolved  problem  than  when  applied  to  the  original  problem.  Presolving  is 
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beneficial  regardless  of  what  algorithm  is  used  to  solve  the  linear  program;  it  is  used  both 
in  simplex  and  interior-point  codes.  Infeasibility  may  also  be  detected  by  the  presolver, 
eliminating  the  need  to  call  the  linear  programming  algorithm  at  all. 

We  mention  just  a  few  of  the  more  straightforward  preprocessing  techniques  here, 
referring  the  interested  reader  to  Andersen  and  Andersen  [4]  for  a  more  comprehensive  list. 
For  the  purpose  of  this  discussion,  we  assume  that  the  linear  program  is  formulated  with 
both  lower  and  upper  bounds  on  x,  that  is, 

min  cT x,  subject  to  Ax  —  b,l  <  x  <  u,  (13.57) 

where  some  components  /,■  of  the  lower  bound  vector  may  be  —  oo  and  some  upper  bounds 
Uj  maybe  +oo. 

Consider  first  a  row  singleton ,  which  happens  when  one  of  the  equality  constraints 
involves  just  one  of  the  variables.  Specifically,  if  constraint  k  involves  only  variable  j  (that 
is,  Akj  ^  0,  but  A*,-  =  0  for  all  i  ^  j),  we  can  immediately  set  Xj  —  bk/  Akj  and  eliminate 
Xj  from  the  problem.  Note  that  if  this  value  of  xj  violates  its  bounds  (that  is,  Xj  <  lj  or 
Xj  >  Uj),  we  can  declare  the  problem  to  be  infeasible,  and  terminate. 

Another  obvious  technique  is  the  free  column  singleton,  in  which  there  is  a  variable  Xj 
that  occurs  in  only  one  of  the  equality  constraints,  and  is  free  (that  is,  its  lower  bound  is  —  oo 
and  its  upper  bound  is  +oo).  In  this  case,  we  have  for  some  k  that  Akj  f-  0  while  A/j  =  0 
for  all  /  k.  Here  we  can  simply  use  constraint  k  to  eliminate  Xj  from  the  problem,  setting 

bk  ~  'Hpyij  AkpXp 

xj  =  - ^ - . 

Akj 

Once  the  values  of  xp  for  p  j  have  been  obtained  by  solving  a  reduced  linear  program, 
we  can  substitute  into  this  formula  to  recover  xj  prior  to  returning  the  result  to  the  user. 
This  substitution  does  not  require  us  to  modify  any  other  constraints,  but  it  will  change  the 
cost  vector  c  in  general,  whenever  Cj  0.  We  will  need  to  make  the  replacement 

cP  <-cp-  CjAkp/Akj,  for  all  p  /  j. 

In  this  case,  we  can  also  determine  the  dual  variable  associated  with  constraint  k.  Since  Xj  is 
a  free  variable,  there  is  no  dual  slack  associated  with  it,  so  the  j  th  dual  constraint  becomes 

m 

Alj^l  —  cj  Akj^-k  —  Cj , 

1=1 


from  which  we  deduce  that  Xk  —  Cj  /  Akj. 

Perhaps  the  simplest  preprocessing  check  is  for  the  presence  of  zero  rows  and  columns 
in  A.  If  A  hi  —  0  for  all  i  —  1,2 then  provided  that  the  right-hand  side  is  also 
zero  ( bk  —  0),  we  can  simply  delete  this  row  from  the  problem  and  set  the  corresponding 
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Lagrange  multiplier  A*  to  an  arbitrary  value.  For  a  zero  column — say,  A ^  =  0  for  all 
k  —  1,2,  ... ,  m — we  can  determing  the  optimal  value  of  Xj  by  inspecting  its  cost  coefficient 
Cj  and  its  bounds  lj  and  iij.  If  Cj  <  0,  we  set  x;-  —  uj  to  minimize  the  product  CjXj.  (We 
are  free  to  do  so  because  Xj  is  not  restricted  by  any  of  the  equality  constraints.)  If  Cj  <  0 
and  Uj  =  +oo,  then  the  problem  is  unbounded.  Similarly,  if  cj  >  0,  we  set  Xj  —  lj,  or  else 
declare  unboundedness  if  lj  =  —  oo. 

A  somewhat  more  subtle  presolving  technique  is  to  check  for  forcing  or  dominated 
constraints.  Rather  than  give  a  general  specification,  we  illustrate  this  case  with  a  simple 
example.  Suppose  that  one  of  the  equality  constraints  is  as  follows: 


5.Vl  —  X4  +  2X5  —  10, 


where  the  variables  in  question  have  the  following  bounds: 

0  <  X\  <  1,  —  1  <  X4  <  5,  0  <  x5  <  2. 

It  is  not  hard  to  see  that  the  equality  constraint  can  only  be  satisfied  if  X|  and  X5  are  at  their 
upper  bounds  and  X4  is  at  its  lower  bound.  Any  other  feasible  values  of  these  variables  would 
result  in  the  left-hand  side  of  the  equality  constraint  being  strictly  less  than  10.  Hence,  we 
can  set  xi  =  1,  x4  =  —1,  x5  =  2  and  eliminate  these  variables,  and  the  equality  constraint, 
from  the  problem. 

We  use  a  similar  example  to  illustrate  dominated  constraints.  Suppose  that  we  have 
the  following  constraint  involving  three  variables: 


2.X2  +  X(,  —  3x7  =  8, 


where  the  variables  in  question  have  the  following  bounds: 

— 10  <  X2  <  10,  0  <  x6  <  1,  0  <  x7  <  2. 

By  rearranging  the  constraint  and  using  the  bounds  on  X6  and  x7,  we  find  that 

x2  =  4  -  (l/2)x6  +  (3/2)x7  <  4  -  0  +  (3/2)2  =  7. 

and  similarly,  using  the  opposite  bounds  on  x6  and  x7  we  obtain  x2  >  7 /2.  We  conclude 
that  the  stated  bounds  of  — 10  and  10  on  x2  are  redundant,  since  x2  is  implicitly  confined  to 
an  even  smaller  interval  by  the  combination  of  the  equality  constraint  and  the  bounds  on 
Xg  and  x7.  Hence,  we  can  drop  the  bounds  on  x2  from  the  formulation  and  treat  it  as  a  free 
variable. 

Presolving  techniques  are  applied  recursively,  because  the  elimination  of  certain  vari¬ 
ables  or  constraints  may  create  situations  that  allow  further  eliminations.  As  a  trivial  example, 
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suppose  that  the  following  two  equality  constraints  are  present  in  the  problem: 

3X2  =  6,  X2  +  4X5  =  10. 

The  first  of  these  constraints  is  a  row  singleton,  which  we  can  use  to  set  X2  =  2  and 
eliminate  this  variable  and  constraint.  After  substitution,  the  second  constraint  becomes 
4x5  =  10  —  X2  =  8,  which  is  again  a  row  singleton.  We  can  therefore  set  x5  =  2  and 
eliminate  this  variable  and  constraint  as  well. 

Relatively  little  information  about  presolving  techniques  has  appeared  in  the  liter¬ 
ature,  in  part  because  they  have  commercial  value  as  an  important  component  of  linear 
programming  software. 


1 3.8  WHERE  DOES  THE  SIMPLEX  METHOD  FIT? 


In  linear  programming,  as  in  all  optimization  problems  in  which  inequality  constraints  are 
present,  the  fundamental  task  of  the  algorithm  is  to  determine  which  of  these  constraints 
are  active  at  the  solution  (see  Definition  12.1  and  which  are  inactive.  The  simplex  method 
belongs  to  a  general  class  of  algorithms  for  constrained  optimization  known  as  active  set 
methods,  which  explicitly  maintain  estimates  of  the  active  and  inactive  index  sets  that  are 
updated  at  each  step  of  the  algorithm.  (At  each  iteration,  the  basis  B  is  our  current  estimate 
of  the  inactive  set,  that  is,  the  set  of  indices  i  for  which  we  suspect  that  X;  >  0  at  the 
solution  of  the  linear  program.)  Like  most  active  set  methods,  the  simplex  method  makes 
only  modest  changes  to  these  index  sets  at  each  step;  a  single  index  is  exchanged  between  B 
into  J\f. 

Active  set  algorithms  for  quadratic  programming,  bound-constrained  optimization, 
and  nonlinear  programming  use  the  same  basic  strategy  as  simplex  of  making  an  explicit 
estimate  of  the  active  set  and  taking  a  step  toward  the  solution  of  a  reduced  problem  in  which 
the  constraints  in  this  estimated  active  set  are  satisfied  as  equalities.  When  nonlinearity  enters 
the  problem,  many  of  the  features  that  make  the  simplex  method  so  effective  no  longer  apply. 
For  example,  it  is  no  longer  true  in  general  that  at  least  n  —  m  of  the  bounds  x  >  0  are  active 
at  the  solution,  and  the  specialized  linear  algebra  techniques  described  in  Section  13.5  no 
longer  apply.  Nevertheless,  the  simplex  method  is  rightly  viewed  as  the  antecedent  of  the 
active  set  class  of  methods  for  constrained  optimization. 

One  undesirable  feature  of  the  simplex  method  attracted  attention  from  its  earliest 
days.  Though  highly  efficient  on  almost  all  practical  problems  (the  method  generally  requires 
at  most  2m  to  3 m  iterations,  where  m  is  the  row  dimension  of  the  constraint  matrix  in 
(13.1)),  there  are  pathological  problems  on  which  the  algorithm  performs  very  poorly. 
Klee  and  Minty  [182]  presented  an  «-dimensional  problem  whose  feasible  polytope  has  2" 
vertices,  for  which  the  simplex  method  visits  every  single  vertex  before  reaching  the  optimal 
point!  This  example  verified  that  the  complexity  of  the  simplex  method  is  exponential, 
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roughly  speaking,  its  running  time  may  be  an  exponential  function  of  the  dimension  of 
the  problem.  For  many  years,  theoreticians  searched  for  a  linear  programming  algorithm 
that  has  polynomial  complexity,  that  is,  an  algorithm  in  which  the  running  time  is  bounded 
by  a  polynomial  function  of  the  amount  of  storage  required  to  define  the  problem.  In 
the  late  1970s,  Khachiyan  [180]  described  an  ellipsoid  method  that  indeed  has  polynomial 
complexity  but  turned  out  to  be  impractical.  In  the  mid-1980s,  Karmarkar  [175]  described 
a  polynomial  algorithm  that  approaches  the  solution  through  the  interior  of  the  feasible 
polytope  rather  than  working  its  way  around  the  boundary  as  the  simplex  method  does. 
Karmarkar ’s  announcement  marked  the  start  of  intense  research  in  the  field  of  interior-point 
methods,  which  are  the  subject  of  the  next  chapter. 


NOTES  AND  REFERENCES 

The  standard  reference  for  the  simplex  method  is  Dantzig’s  book  [86].  Later  excellent 
texts  include  Chvatal  [61]  and  Vanderbei  [293]. 

Further  information  on  steepest-edge  pivoting  can  be  found  in  Goldfarb  and 
Reid  [134]  and  Goldfarb  and  Forrest  [133]. 

An  alternative  procedure  for  performing  the  Phase-I  calculation  of  an  initial  basis 
was  described  by  Wolfe  [310].  This  technique  does  not  require  artificial  variables  to  be 
introduced  in  the  problem  formulation,  but  rather  starts  at  any  point  x  that  satisfies  Ax  —  b 
with  at  most  m  nonzero  components  in  x.  (Note  that  we  do  not  require  the  basic  part  x„  to 
consist  of  all  positive  components.)  Phase  I  then  consists  in  solving  the  problem 

min  }  —x,  subject  to  Ax  =  b , 

X  ' 

Xi<  0 


and  terminating  when  an  objective  value  of  0  is  attained.  This  problem  is  not  a  linear 
program — its  objective  is  only  piecewise  linear — but  it  can  be  solved  by  the  simplex  method 
nonetheless.  The  key  is  to  redefine  the  cost  vector  /  at  each  iteration  x  such  that  fi  —  —  1 
for  Xi  <  0  and  fi  —  0  otherwise. 


i#7  Exercises 

i#7  13.1  Convert  the  following  linear  program  to  standard  form: 

maxcrx  +  dT y  subject  to  A ix  —  b\ ,  A2x  +  Biy  <b2,  l  <  v  <  u, 

*,y 

where  there  are  no  explicit  bounds  on  x. 

i#7  13.2  Verify  that  the  dual  of  (13.8)  is  the  original  primal  problem  (13.1). 
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&  13.3  Complete  the  proof  of  Theorem  13.1  by  showing  that  if  the  dual  (13.7)  is 

unbounded  above,  the  primal  (13.1)  must  be  infeasible. 

&  13.4  Theorem  13.1  does  not  exclude  the  possibility  that  both  primal  and  dual  are 

infeasible.  Give  a  simple  linear  program  for  which  such  is  the  case. 

&  13.5  Show  that  the  dual  of  the  linear  program 

min  cTx  subject  to  Ax  >  b,  x  >  0, 


is 


max  bT  X  subject  to  AT  X  <  c,  X  >  0. 


&  13.6  Show  that  when  m  <  n  and  the  rows  of  A  are  linearly  dependent  in  ( 13.1),  then 

the  matrix  B  in  (13.13)  is  singular,  and  therefore  there  are  no  basic  feasible  points. 

&  13.7  Consider  the  overdetermined  linear  system  Ax  —  b  with  m  rows  and  n  columns 

( m  >  n).  When  we  apply  Gaussian  elimination  with  complete  pivoting  to  A,  we  obtain 


PAQ  =  L 


U  n 
0 


Un 

0 


where  P  and  Q  are  permutation  matrices,  L  is  m  x  m  lower  triangular,  Un  is  m  x  m  upper 
triangular  and  nonsingular,  C/12  is  m  x  (n  —  m ),  and  m  <  n  is  the  rank  of  A. 

(a)  Show  that  the  system  Ax  =  b  is  feasible  if  the  last  m  —  m  components  of  L~xPb  are 
zero,  and  infeasible  otherwise. 

(b)  When  m  =  n,  find  the  unique  solution  of  Ax  —  b. 

(c)  Show  that  the  reduced  system  formed  from  the  first  m  rows  of  PA  and  the  first  m 
components  of  Pb  is  equivalent  to  Ax  —  b  (i.e.,  a  solution  of  one  system  also  solves 
the  other). 

13.8  Verify  formula  (13.37). 

13.9  Consider  the  following  linear  program: 


min  —  5xi  —  X2  subject  to 

*i  +  <  5, 

2xi  +  (l/2)x2  <  8, 
x  >  0. 
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(a)  Add  slack  variables  X3  and  x4  to  convert  this  problem  to  standard  form. 

(b)  Using  Procedure  13.1,  solve  this  problem  using  the  simplex  method,  showing  at  each 
step  the  basis  and  the  vectors  X,  sN,  and  .rB,  and  the  value  of  the  objective  function.  (The 
initial  choice  of  B  for  which  ,rB  >  0  should  be  obvious  once  you  have  added  the  slacks 
in  part  (a).) 

13.10  Calculate  the  values  of  /5 2,  I53,  I54,  and  W2  in  (13.30),  by  equating  the  last  row 
oiL\U\  to  the  last  row  of  the  matrix  in  (13.29). 

i#7  13.11  By  extending  the  procedure  (13.27)  appropriately,  show  how  the  factorization 

(13.31)  can  be  used  to  solve  linear  systems  with  coefficient  matrix  B+  efficiently. 
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In  the  1980s  it  was  discovered  that  many  large  linear  programs  could  be  solved  efficiently  by 
using  formulations  and  algorithms  from  nonlinear  programming  and  nonlinear  equations. 
One  characteristic  of  these  methods  was  that  they  required  all  iterates  to  satisfy  the  inequality 
constraints  in  the  problem  strictly,  so  they  became  known  as  interior-point  methods.  By 
the  early  1990s,  a  subclass  of  interior-point  methods  known  as  primal-dual  methods  had 
distinguished  themselves  as  the  most  efficient  practical  approaches,  and  proved  to  be  strong 
competitors  to  the  simplex  method  on  large  problems.  These  methods  are  the  focus  of  this 
chapter. 
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Interior-point  methods  arose  from  the  search  for  algorithms  with  better  theoretical 
properties  than  the  simplex  method.  As  we  mentioned  in  Chapter  1 3 ,  the  simplex  method  can 
be  inefficient  on  certain  pathological  problems.  Roughly  speaking,  the  time  required  to  solve 
a  linear  program  may  be  exponential  in  the  size  of  the  problem,  as  measured  by  the  number 
of  unknowns  and  the  amount  of  storage  needed  for  the  problem  data.  For  almost  all  practical 
problems,  the  simplex  method  is  much  more  efficient  than  this  bound  would  suggest,  but 
its  poor  worst-case  complexity  motivated  the  development  of  new  algorithms  with  better 
guaranteed  performance.  The  first  such  method  was  the  ellipsoid  method,  proposed  by 
Khachiyan  [180],  which  finds  a  solution  in  time  that  is  at  worst  polynomial  in  the  problem 
size.  Unfortunately,  this  method  approaches  its  worst-case  bound  on  all  problems  and  is 
not  competitive  with  the  simplex  method  in  practice. 

Karmarkar’s  projective  algorithm  [175],  announced  in  1984,  also  has  the  polynomial 
complexity  property,  but  it  came  with  the  added  attraction  of  good  practical  behavior.  The 
initial  claims  of  excellent  performance  on  large  linear  programs  were  never  fully  borne 
out,  but  the  announcement  prompted  a  great  deal  of  research  activity  which  gave  rise  to 
many  new  methods.  All  are  related  to  Karmarkar’s  original  algorithm,  and  to  the  log-barrier 
approach  described  in  Chapter  19,  but  many  of  the  approaches  can  be  motivated  and 
analyzed  independently  of  the  earlier  methods. 

Interior-point  methods  share  common  features  that  distinguish  them  from  the  simplex 
method.  Each  interior-point  iteration  is  expensive  to  compute  and  can  make  significant 
progress  towards  the  solution,  while  the  simplex  method  usually  requires  a  larger  number  of 
inexpensive  iterations.  Geometrically  speaking,  the  simplex  method  works  its  way  around 
the  boundary  of  the  feasible  polytope,  testing  a  sequence  of  vertices  in  turn  until  it  finds  the 
optimal  one.  Interior-point  methods  approach  the  boundary  of  the  feasible  set  only  in  the 
limit.  They  may  approach  the  solution  either  from  the  interior  or  the  exterior  of  the  feasible 
region,  but  they  never  actually  lie  on  the  boundary  of  this  region. 

In  this  chapter,  we  outline  some  of  the  basic  ideas  behind  primal-dual  interior-point 
methods,  including  the  relationship  to  Newton’s  method  and  homotopy  methods  and  the 
concept  of  the  central  path.  We  sketch  the  important  methods  in  this  class,  and  give  a  com¬ 
prehensive  convergence  analysis  of  a  particular  interior-point  method  known  as  a  long-step 
path-followingmethod.  We  describe  in  some  detail  a  practical  predictor-corrector  algorithm 
proposed  by  Mehrotra,  which  is  the  basis  of  much  of  the  current  generation  of  software. 


14.1  PRIMAL-DUAL  METHODS 

OUTLINE 

We  consider  the  linear  programming  problem  in  standard  form;  that  is, 

min  cT x,  subject  to  Ax  —  b,x>  0,  ( 14.1) 


where  c  and  x  are  vectors  in  R" ,  b  is  a  vector  in  R'" ,  and  A  is  an  m  x  n  matrix  with  full  row 
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rank.  (As  in  Chapter  13,  we  can  preprocess  the  problem  to  remove  dependent  rows  from  A 
if  necessary.)  The  dual  problem  for  (14.1)  is 

max  bTX,  subject  to  AT  X  +  s  —  c,  s  >  0,  (14.2) 

where  X  is  a  vector  in  Rm  and  s’  is  a  vector  in  R".  As  shown  in  Chapter  13,  solutions  of 
(14.1), (14.2)  are  characterized  by  the  Karush-Kuhn-Tucker  conditions  (13.4),  which  we 
restate  here  as  follows: 


AtX  +  5  =  c,  (14.3a) 

Ax  =  b,  (14.3b) 

XiSi  =0,  /  =  1,2,...,  n ,  (14.3c) 

(*,s)>0.  (14.3d) 


Primal-dual  methods  find  solutions  (x*,X*,s*)  of  this  system  by  applying  variants  of 
Newton’s  method  to  the  three  equalities  in  (14.3)  and  modifying  the  search  directions  and 
step  lengths  so  that  the  inequalities  (x,  s)  >  0  are  satisfied  strictly  at  every  iteration.  The 
equations  (14.3a),  (14.3b),  (14.3c)  are  linear  or  only  mildly  nonlinear  and  so  are  not  difficult 
to  solve  by  themselves.  However,  the  problem  becomes  much  more  difficult  when  we  add  the 
nonnegativity  requirement  (14.3d),  which  gives  rise  to  all  the  complications  in  the  design 
and  analysis  of  interior-point  methods. 

To  derive  primal-dual  interior-point  methods  we  restate  the  optimality  conditions 
(14.3)  in  a  slightly  different  form  by  means  of  a  mapping  F  from  R2”+m  to  R2"+m : 


F(x,  X,  s) 


A^  X  s  —  c 


Ax  —  b 
XSe 


=  0, 


(x,  s )  >  0, 


(14.4a) 

(14.4b) 


where 


X  =  diag(xi,x2,  •  •  •  ,x„),  S  =  diag(«i,  s2, . . . ,  s„),  (14.5) 

and  e  =  (1,  1, ... ,  l)r.  Primal-dual  methods  generate  iterates  ( xk ,  Xk,  sk)  that  satisfy  the 
bounds  (14.4b)  strictly,  that  is,  xk  >  0  and  sk  >  0.  This  property  is  the  origin  of  the  term 
interior-point.  By  respecting  these  bounds,  the  methods  avoid  spurious  solutions,  that  is, 
points  that  satisfy  F{x,X,s)  =  0  but  not  (x,  s)  >  0.  Spurious  solutions  abound,  and  do  not 
provide  useful  information  about  solutions  of  (14.1)  or  (14.2),  so  it  makes  sense  to  exclude 
them  altogether  from  the  region  of  search. 

Like  most  iterative  algorithms  in  optimization,  primal-dual  interior-point  methods 
have  two  basic  ingredients:  a  procedure  for  determining  the  step  and  a  measure  of  the 
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desirability  of  each  point  in  the  search  space.  An  important  component  of  the  measure  of 
desirability  is  the  average  value  of  the  pairwise  products  x(s,-,  i  —  1,2,...,  n,  which  are  all 
positive  when  x  >  (lands  >  0.  This  quantity  is  known  as  the  duality  measure  and  is  defined 
as  follows: 


n 

J2x‘si 

i= 1 


n 


(14.6) 


The  procedure  for  determining  the  search  direction  has  its  origins  in  Newton’s  method 
for  the  nonlinear  equations  ( 14.4a).  Newton’s  method  forms  a  linear  model  for  F  around 
the  current  point  and  obtains  the  search  direction  (Ax,  AA,  As)  by  solving  the  following 
system  of  linear  equations: 


Ax 


J(x,  4,  s) 


AX 

As 


—  F(x,  4,  s). 


where  J  is  the  Jacobian  of  F.  (See  Chapter  1 1  for  a  detailed  discussion  of  Newton’s  method 
for  nonlinear  systems.)  If  we  use  the  notation  rc  and  r*  for  the  first  two  block  rows  in  F, 
that  is, 


t'b  =  Ax  —  b,  rc  =  AT  X  +  s  —  c,  (14.7) 

we  can  write  the  Newton  equations  as  follows: 


1 

o 

1 _ 

Ax 

~rc 

o 

o 

AX 

= 

-n 

1 — 

o 

X 

i _ 

As 

-XSe  _ 

Usually,  a  full  step  along  this  direction  would  violate  the  bound  {x,  s)  >  0,  so  we  perform  a 
line  search  along  the  Newton  direction  and  define  the  new  iterate  as 

(x,  X,  s)  +  a(Ax,  AA,,  As), 

for  some  line  search  parameter  a  e  (0,1].  We  often  can  take  only  a  small  step  along  this 
direction  (a  «  1)  before  violating  the  condition  (x,  s)  >  0.  Hence,  the  pure  Newton 
direction  (14.8),  sometimes  known  as  the  affine  scaling  direction,  often  does  not  allow  us  to 
make  much  progress  toward  a  solution. 

Most  primal-dual  methods  use  a  less  aggressive  Newton  direction,  one  that  does  not 
aim  directly  for  a  solution  of  (14.3a),  (14.3b),  (14.3c)  but  rather  for  a  point  whose  pairwise 
products  xtSj  are  reduced  to  a  lower  average  value — not  all  the  way  to  zero.  Specifically,  we 


396  Chapter  14.  Interior-Point  Methods 


take  a  Newton  step  toward  the  a  point  for  which  x/ s,  —  op,  where  p  is  the  current  duality 
measure  and  a  e  [0,  1  ]  is  the  reduction  factor  that  we  wish  to  achieve  in  the  duality  measure 
on  this  step.  The  modified  step  equation  is  then 


— 1 

o 

_ 1 

Ax 

~rc 

o 

o 

AX 

= 

-n 

L  5  0  X 

As 

—XSe  +  ape 

(14.9) 


We  call  a  the  centering  parameter,  for  reasons  to  be  discussed  below.  When  a  >  0,  it  usually 
is  possible  to  take  a  longer  step  a  along  the  direction  defined  by  ( 14. 16)  before  violating  the 
bounds  (x,  s)  >  0. 

At  this  point,  we  have  specified  most  of  the  elements  of  a  path-following  primal-dual 
interior-point  method.  The  general  framework  for  such  methods  is  as  follows. 

Framework  14.1  (Primal -Dual  Path-Followin3). 

Given  (x°,  X°,s°)  with  (x°,  s0)  >  0; 
for  k  —  0,  1,  2, . . . 

Choose  (Jk  e  [0,  1]  and  solve 


0  At  I 

"  Axk  " 

r  i 

A  0  0 

AXk 

= 

-rl 

Sk  0  Xk 

A sk  _ 

—XkSke  +  a/!pi<e 

where  pb  =  (xk)Tsk/n; 
Set 


(14.10) 


(jc*+1,A.t+1,st+1)  =  (xk,Xk,sk)  +  ak(Axk,AXk,Ask),  (14.11) 

choosing  so  that  (xk+] ,  sk+1)  >  0. 

end  (for). 

The  choices  of  centering  parameter  a \  and  step  length  a *  are  crucial  to  the  performance 
of  the  method.  Techniques  for  controlling  these  parameters,  directly  and  indirectly,  give  rise 
to  a  wide  variety  of  methods  with  diverse  properties. 

Although  software  for  implementing  interior-point  methods  does  not  usually  start 
from  a  point  (x°,  4°,  s°)  that  is  feasible  with  respect  to  the  linear  equations  (14.3a)  and 
( 14.3b),  most  of  the  historical  development  of  theory  and  algorithms  assumed  that  these 
conditions  are  satisfied.  In  the  remainder  of  this  section,  we  discuss  this  feasible  case, 
showing  that  a  comprehensive  convergence  analysis  can  be  presented  in  just  a  few  pages, 
using  only  basic  mathematical  tools  and  concepts.  Analysis  of  the  infeasible  case  follows  the 
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same  principles,  but  is  considerably  more  complicated  in  the  details,  so  we  do  not  present 
it  here.  In  Section  14.2,  however,  we  describe  a  complete  practical  algorithm  that  does  not 
require  starting  from  a  feasible  initial  point. 

To  begin  our  discussion  and  analysis  of  feasible  interior-point  methods,  we  introduce 
the  concept  of  the  central  path,  and  then  describe  neighborhoods  of  this  path. 


THE  CENTRAL  PATH 

The  primal-dual  feasible  set  T  and  strictly  feasible  set  T°  are  defined  as  follows: 

T  —  {(x,  X,  s)  |  Ax  —  b,  AT  X  +  s  =  c,  ( x ,  s)  >  0},  (14.12a) 

T°  —  {(x,  X,  s )  |  Ax  —  b,  AT X  +  s  —  c,  (x,  s)  >  0}.  (14.12b) 


The  central  path  C  is  an  arc  of  strictly  feasible  points  that  plays  a  vital  role  in  primal-dual 
algorithms.  It  is  parametrized  by  a  scalar  r  >  0,  and  each  point  (xT ,  XT ,  sz )  e  C  satisfies  the 
following  equations: 


X  +  s'  =  c, 

(14.13a) 

Ax  =  b, 

(14.13b) 

XiSi  —  r, 

i  —  1,2,.. 

■  ,n. 

(14.13c) 

(x,  s )  >  0. 

(14.13d) 

These  conditions  differ  from  the  KKT  conditions  only  in  the  term  r  on  the  right-hand  side 
of  (14.13c).  Instead  of  the  complementarity  condition  (14.3c),  we  require  that  the  pairwise 
products  XjSi  have  the  same  (positive)  value  for  all  indices  i.  From  (14.13),  we  can  define 
the  central  path  as 


C  —  {(xT,  Xr,  sT)  |  r  >  0}. 

It  can  be  shown  that  (xT ,XT,sr)  is  defined  uniquely  for  each  r  >  0  if  and  only  if  T°  is 
nonempty. 

The  conditions  (14.13)  are  also  the  optimality  conditions  for  a  logarithmic-barrier 
formulation  of  the  problem  (14.1).  By  introducing  log-barrier  terms  for  the  nonnegativity 
constraints,  with  barrier  parameter  r  >  0,  we  obtain 


n 

min  cT x  —  r  lnx; ,  subject  to  Ax  =  b. 

i= 1 


(14.14) 
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The  KKT  conditions  ( 12.34)  for  this  problem,  with  Lagrange  multiplier  X  for  the  equality 
constraint,  are  as  follows: 

r  T 

c, - A  :X,  i  —  1,2,...,  n,  Ax  —  b. 

Xi 

Since  the  objective  in  (14.14)  is  strictly  convex,  these  conditions  are  sufficient  as  well  as 
necessary  for  optimality.  We  recover  (14.13)  by  defining  Sj  —  r/xj,i  —  1,2 ,...,«. 
Another  way  of  defining  C  is  to  use  the  mapping  F  defined  in  (14.4)  and  write 


The  equations  (14.13)  approximate  (14.3)  more  and  more  closely  as  r  goes  to  zero.  If 
C  converges  to  anything  as  r  4-  0,  it  must  converge  to  a  primal-dual  solution  of  the  linear 
program.  The  central  path  thus  guides  us  to  a  solution  along  a  route  that  maintains  positivity 
of  the  x  and  5  components  and  decreases  the  pairwise  products  /  =  1,2,...,  n  to  zero 
at  the  same  rate. 

Most  primal-dual  algorithms  take  Newton  steps  toward  points  on  C  for  which  r  >  0, 
rather  than  pure  Newton  steps  for  F.  Since  these  steps  are  biased  toward  the  interior  of 
the  nonnegative  orthant  defined  by  {x,  s)  >  0,  it  usually  is  possible  to  take  longer  steps 
along  them  than  along  the  pure  Newton  (affine  scaling)  steps,  before  violating  the  positivity 
condition. 

In  the  feasible  case  of  (x ,  X ,  s )  e  T,  we  have  /■*  =  0  and  rc  =  0,  so  the  search  direction 
satisfies  a  special  case  of  (14.8),  that  is, 

0  At  /  "I  T  Ax  I  [  0 

A  0  0  Ak  =  0 

S  0  X  _  _  As  _  —XSe  +  ufie 


(14.16) 


where  /u  is  the  duality  measure  defined  by  (14.6)  and  a  e  [0,  1]  is  the  centering  pa¬ 
rameter.  When  a  =  1,  the  equations  (14.16)  define  a  centering  direction,  a  Newton  step 
toward  the  point  (xfl ,  X s^)  G  C,  at  which  all  the  pairwise  products  x(- y  are  identical 
to  the  current  average  value  of  /x.  Centering  directions  are  usually  biased  strongly  toward 
the  interior  of  the  nonnegative  orthant  and  make  little,  if  any,  progress  in  reducing  the 
duality  measure  /x.  However,  by  moving  closer  to  C,  they  set  the  scene  for  a  substantial 
reduction  in  fi  on  the  next  iteration.  At  the  other  extreme,  the  value  a  —  0  gives  the 
standard  Newton  (affine  scaling)  step.  Many  algorithms  use  intermediate  values  of  a  from 
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the  open  interval  (0,  1)  to  trade  off  between  the  twin  goals  of  reducing  p  and  improving 
centrality. 

CENTRAL  PATH  NEIGHBORHOODS  AND  PATH-FOLLOWING  METHODS 

Path-following  algorithms  explicitly  restrict  the  iterates  to  a  neighborhood  of  the 
central  path  C  and  follow  C  to  a  solution  of  the  linear  program.  By  preventing  the  iterates 
from  coming  too  close  to  the  boundary  of  the  nonnegative  orthant,  they  ensure  that  it  is 
possible  to  take  a  nontrivial  step  along  each  search  direction.  Mopreover,  by  forcing  the 
duality  measure  pk  to  zero  as  k  — >•  oo,  we  ensure  that  the  iterates  ( xk ,  kk ,  sk )  come  closer 
and  closer  to  satisfying  the  KKT  conditions  (14.3). 

The  two  most  interesting  neighborhoods  of  C  are 

J\f2(6)  =  {(x,  4,  s)eT°  |  || XSe  -  pe ||2  <  6>/x},  (14.17) 

for  some  6  e  [0,  1),  and 


Af-oo (y)  —  {(x,  k,  s)  e  T°  \  XiSj  >  y  p  all  i  —  1,2, ,  n},  (14.18) 

for  some  y  e  (0,  1].  (Typical  values  of  the  parameters  are  9  =  0.5  and  y  =  10~3.)  If 
a  point  lies  in  J\f-oo{y),  each  pairwise  product  x/S;  must  be  at  least  some  small  multiple 
y  of  their  average  value  p.  This  requirement  is  actually  quite  modest,  and  we  can  make 
Af-ooiy)  encompass  most  of  the  feasible  region  T  by  choosing  y  close  to  zero.  The  M2{9) 
neighborhood  is  more  restrictive,  since  certain  points  in  T°  do  not  belong  to  N2{9)  no 
matter  how  close  0  is  chosen  to  its  upper  bound  of  1. 

By  keeping  all  iterates  inside  one  or  other  of  these  neighborhoods,  path-following 
methods  reduce  all  the  pairwise  products  x,-s(-  to  zero  at  more  or  less  the  same  rate.  Figure  14. 1 
shows  the  projection  of  the  central  path  C  onto  the  primal  variables  for  a  typical  problem, 
along  with  a  typical  neighborhood  A f. 

Path-following  methods  are  akin  to  homotopy  methods  for  general  nonlinear  equa¬ 
tions,  which  also  define  a  path  to  be  followed  to  the  solution.  Traditional  homotopy  methods 
stay  in  a  tight  tubular  neighborhood  of  their  path,  making  incremental  changes  to  the  pa¬ 
rameter  and  chasing  the  homotopy  path  all  the  way  to  a  solution.  For  primal-dual  methods, 
this  neighborhood  is  horn-shaped  rather  than  tubular,  and  it  tends  to  be  broad  and  loose 
for  larger  values  of  the  duality  measure  ji.  It  narrows  as  /r  -»  0,  however,  because  of  the 
positivity  requirement  (x,  s)  >  0. 

The  algorithm  we  specify  below,  a  special  case  of  Framework  14.1,  is  known  as  a 
long-step  path-following  algorithm.  This  algorithm  can  make  rapid  progress  because  of  its 
use  of  the  wide  neighborhood  Af-oo(y),  for  y  close  to  zero.  It  depends  on  two  parameters 
ermin  and  ermax,  which  are  lower  and  upper  bounds  on  the  centering  parameter  o>.  The  search 
direction  is,  as  usual,  obtained  by  solving  (14.10),  and  we  choose  the  step  length  to  be  as 
large  as  possible,  subject  to  the  requirement  that  we  stay  inside  Af-ooiy)- 
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Figure  14.1  Central  path,  projected  into  space  of  primal  variables  x,  showing  a 
typical  neighborhood  A f. 


Here  and  in  later  analysis,  we  use  the  notation 

(xk(a),  Xk(a),  sk(a))  =  (xk,  Xk ,  sk)  +  a{Axk,  AXk ,  A sk),  (14.19a) 

Hkioi)  =  xk(a)T  sk(a)/n.  (14.19b) 

Algorithm  14.2  (Long-Step  Path-Following). 

Given  y ,  <rmjn ,  (Tmax  with  y  G  (0,  1),0  <  rrmm  <  crmax  <  1, 
and  (x°,  X°,  5°)  e  Af-oo(y); 
for  k  —  0,  1,  2, . . . 

Choose  ct/.  G  [  crmm ,  <rnliLX  ] , 

Solve  (14.10)  to  obtain  (Ax*,  AXk ,  A sk); 

Choose  at  as  the  largest  value  of  a  in  [0,  1]  such  that 


(xk{a),  Xk(a),  sk(a))  e  Af-ooiyY,  (14.20) 

Set  (xk+1,  Xk+1,  s*+1)  =  (. xk(dk ),  Xk(ak),  sk{ak)); 

end  (for). 

Typical  behavior  of  the  algorithm  is  illustrated  in  Figure  14.2  for  the  case  of  n  =  2. 
The  horizontal  and  vertical  axes  in  this  figure  represent  the  pairwise  products  XiSi  and  X2S2, 
so  the  central  path  C  is  the  line  emanating  from  the  origin  at  an  angle  of  45°.  (A  point  at  the 
origin  of  this  illustration  is  a  primal-dual  solution  if  it  also  satisfies  the  feasibility  conditions 
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(14.3a),  (14.3b),  and  (14.3d).)  In  the  unusual  geometry  of  Figure  14.2,  the  search  directions 
( Axk ,  AXk,  A sk)  transform  to  curves  rather  than  straight  lines. 

As  Figure  14.2  shows  (and  the  analysis  confirms),  the  lower  bound  erm;n  on  the 
centering  parameter  ensures  that  each  search  direction  starts  out  by  moving  away  from  the 
boundary  of  Af-oo(y)  and  into  the  relative  interior  of  this  neighborhood.  That  is,  small 
steps  along  the  search  direction  improve  the  centrality.  Larger  values  of  a  take  us  outside 
the  neighborhood  again,  since  the  error  in  approximating  the  nonlinear  system  (14.15)  by 
the  linear  step  equations  (14.16)  becomes  more  pronounced  as  a  increases.  Still,  we  are 
guaranteed  that  a  certain  minimum  step  can  be  taken  before  we  reach  the  boundary  of 
N-oo(y)’  as  we  show  in  the  analysis  below. 

The  analysis  of  Algorithm  14.2  appears  in  the  next  few  pages.  With  judicious  choices 
of  <Jk,  this  algorithm  is  fairly  efficient  in  practice.  With  a  few  more  modifications,  it  becomes 
the  basis  of  a  truly  competitive  method,  as  we  discuss  in  Section  14.2. 

Our  aim  in  the  analysis  below  is  to  show  that  given  some  small  tolerance  e  >  0,  the 
algorithm  requires  O  ( n  |  log  e  | )  iterations  to  reduce  the  duality  measure  by  a  factor  of  e,  that 
is,  to  identify  a  point  (xk,  Xk ,  sk )  for  which  /t,*  <  efiQ.  For  small  e,  the  point  (xk,  Xk,  sk ) 
satisfies  the  primal-dual  optimality  conditions  except  for  perturbations  of  about  e  in  the 
right-hand  side  of  (14.3c),  so  it  is  usually  very  close  to  a  primal-dual  solution  of  the 
original  linear  program.  The  O  ( n  |  log  e  | )  estimate  is  a  worst-case  bound  on  the  number 
of  iterations  required;  on  practical  problems,  the  number  of  iterations  required  appears 
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to  increase  only  slightly  (if  at  all)  as  n  increases.  The  simplex  method  may  require  2" 
iterations  to  solve  a  problem  with  n  variables,  though  in  practice  it  usually  requires  a 
modest  multiple  of  m  iterations,  where  m  is  the  row  dimension  of  the  constraint  matrix 
A  in  (14.1). 

As  is  typical  for  interior-point  methods,  the  analysis  builds  from  a  purely  techni¬ 
cal  lemma  to  a  powerful  theorem  in  just  a  few  pages.  We  start  with  the  technical  result 
(Lemma  14.1)  and  use  it  to  derive  a  bound  on  the  vector  of  pairwise  products  A.XjAsj, 
i  =  1,2, ...  ,n  (Lemma  14.2).  Theorem  14.3  finds  a  lower  bound  on  the  step  length  and 
a  corresponding  estimate  of  the  reduction  in  [x  on  iteration  k.  Finally,  Theorem  14.4  proves 
that  0(n  |  log  e  | )  iterations  are  required  to  identify  a  point  for  which  /x^  <  e,  for  a  given 
tolerance  e  e  (0,  1). 

Lemma  14.1. 

Let  u  and  v  be  any  two  vectors  in  R'!  with  uTv  >  0.  Then 
\\UVe\\2  <  2"3/2|| u  +  v\\22, 

where 

U  =  diag(«i,  ii2, - un),  V  —  diag(i>i,  v2, . . . ,  vn). 

PROOF.  (When  the  subscript  is  omitted  from  ||  •  ||,  we  mean  ||  •  || 2,  as  is  our  convention 
throughout  the  book.)  First,  note  that  for  any  two  scalars  a  and  f3  with  aj5  >  0,  we  have 
from  the  algebraic-geometric  mean  inequality  that 

s/W\  <  ^1“  +  P\-  (14.21) 

Since  uT v  >  0,  we  have 

0  <  uTv  —  ^2  ui vi  +  UiVi  =  _  X!  (14.22) 

UiVi>  0  UiVi<0  /g  V  i^A4 

where  we  partitioned  the  index  set  {1,  2, ... ,  n)  as 


V  —  {i  |  UjVi  >  0}, 


M.  —  [i  |  UiVi  <  0}. 
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Now, 

\\UVe\\  =  (||  [ujVi]i€-p\\2  +  ||[MiV,]ieMl|2)I/2 

—  (ll  [ui  Vj  ]  ||  |  +  ||  [llj  V[  ]i^M  111) 

<  (2  II  [Mi  II?) 1/2 

<V2  ~{uj  +  V/)2 

L4  J ieP  i 

=  2  _3/2^(m;  +  U;)2 
ieV 

<2  _3/2^(m;  +  i’i)2 
1  =  1 

<  2_3^2||m  +  t>||2, 

completing  the  proof. 

For  the  next  result,  we  omit  the  iteration  counter  £  from  (14.10),  and  define  the 
diagonal  matrices  AX  and  AS  similarly  to  (14.5),  as  follows: 

AX  —  diag( Axi ,  Ax2, . . . ,  Axn),  AS  —  diag(A.sr,  Asz,  ■  ■  ■ ,  As,,).. 

Lemma  14.2. 

If(x ,  X,  s)  e  Af-oo(y)>  then 

||  AXASell  <  2"3/2(l  +  l/y)nfi. 

PROOF.  It  is  easy  to  show  using  (14.10)  that 

AxT  As  =  0.  (14.23) 

By  multiplying  the  last  block  row  in  (14.10)  by  ( XS )-1/2  and  using  the  definition  D  — 

X  V2  S~ 1/2 ,  we  obtain 

D~l  Ax  +  DAs  =  (XS)~1/2(-XSe  +  one).  (14.24) 

Because  (D-1  Ax)T ( D As)  —  AxT  As  —  0,  we  can  apply  Lemma  14.1  with  u  —  D~l  Ax 
and  v  —  DAs  to  obtain 

||  AXA5e||  =  ||(£>_1  AX)(DA5)e|| 

<  2_3^2||Z)_1  Ax  +  Z)A.s||2  from  Lemma  14.1 

=  2-3/2||(X5)"1/2(-X5e  +  CTMe)H2  from  (14.24). 


since  ||  •  ||2  <  II  •  II i 
from  (14.22) 

from  (14.21) 


Expanding  the  squared  Euclidean  norm  and  using  such  relationships  as  xTs  —  njj.  and 
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eT e  —  n,  we  obtain 


lAXAS'ell  <  2-3/2 


2  3/2(l  +  1  /y)np. 


as  claimed. 


<  2  3/2  vrs  —  2apeTe  +  a2  p1  - 

z '  XjS; 

L  i= 1  J 

_ o  /•}  rp  rp  -j  j  VI  m 

<  2  '  vs  —  2(7^^  e  +  a  p  —  since  x,s,-  >  y/u 

<  2~3^2  1  —  2ct  H - nix 

r  J 


Theorem  14.3. 

Given  the  parameters  y,  am;n,  and  ermax  in  Algorithm  14.2,  there  is  a  constant  S 
independent  of  n  such  that 


Pk+ 1  <  I  1  -  -  I IM, 


(14.25) 


for  all  k  >  0. 


PROOF.  We  start  by  proving  that 

(xk(a),  Xk(a),  sk(a))  e  Af-ooiy)  for  all  ae  0,2 3/2y - —  —  )  (14.26) 

L  1  +  K  «  _ 

where  (.^(a),  A*(a),  sk(a))  is  defined  as  in  ( 14.19).  It  follows  that  the  step  length  a k  is  at 
least  as  long  as  the  upper  bound  of  this  interval,  that  is, 

>23/2-y|^.  (14.27) 

n  1  +  y 

For  any  i  =  1,  2, . . . ,  n,  we  have  from  Lemma  14.2  that 

| Axk Ask  |  <  HAX^A^elh  <  2"3/2(l  +  1  /y)npk.  (14.28) 

Using  (14.10),  we  have  from  xksk  >  y  pk  and  (14.28)  that 

x-(a)sk(a )  =  (xk  +  a Axk)  (sk  +  aAsf) 

=  xksk  +  a  (xkA sk  +  sk  Axk)  +  a2  Axf  A sk 

>  xksk(  1  —  a)  +  aakpk  ~  a2|  Ax-  As-  \ 

>  y(l  -  a)pk  +  aokpk  -  a22_3/2(l  +  \/y)npk. 
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By  summing  the  n  components  of  the  equation  Sk  Axk  +  Xk  A sk  =  —XkSke  +  akjxke  (the 
third  block  row  from  (14.10)),  and  using  (14.23)  and  the  definition  of  [ik  and  [ik{a)  (see 
(14.19)),  we  obtain 


/bfc(a)  =  (1  -  a(l  -  ak))nk. 


From  these  last  two  formulas,  we  can  see  that  the  proximity  condition 

xk(a)sk(a)  >  ynk(a) 

is  satisfied,  provided  that 

y(  1  -  a )Hk  +  aakUk  ~  a22_3/2(l  +  1  /y)nnk  >y{l-a  +  aak)fik. 
Rearranging  this  expression,  we  obtain 

aakiu,k(l  -y)>  a22~3/2nfik{l  +  l/y). 


which  is  true  if 


23/2  1  -  y 

a  <  - oky— — . 

n  1  +  y 

We  have  proved  that  (xk{a),  )k(a),  sk{a ))  satisfies  the  proximity  condition  for  Af-ooiy) 
when  a  lies  in  the  range  stated  in  (14.26).  It  is  not  difficult  to  show  that 
(.^’(a),  )k(a),  sk(a ))  e  T°  for  all  a  in  the  given  range.  Hence,  we  have  proved  (14.26) 
and  therefore  (14.27). 

We  complete  the  proof  of  the  theorem  by  estimating  the  reduction  in  /x  on  the  £th 
step.  Because  of  (14.23),  ( 14.27),  and  the  last  block  row  of  (14.16),  we  have 


/xjfc-H  =  xk(ak)Tsk(ak)/n 

=  [( xk)T sk  +  ak  {(xk)T  A sk  +  (sk)T  Axk) 
=  fik  +  ak  (■ ~{xk)Tsk/n  +  OklXk) 


—  (1  -  Otk ;(1  -  (Jk))lXk 

(  23/2  1  —  y  \ 

<  i - y  — — 0k( i  -  ^k)  [ik 

V  n  1  +  y  ) 


al(Axk)T A sk  / n 


(14.29) 


Now,  the  function  a  (1  —  cr)  is  a  concave  quadratic  function  of  it,  so  on  any  given  interval 
it  attains  its  minimum  value  at  one  of  the  endpoints.  Hence,  we  have 


^r(1  ok)  A  rnin  {crmin(  1  irnlm  j ,  iT|T1(IX ( 1  ermax)} ,  for  all  crk  £  [fTmm .  irnlax ]  • 
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The  proof  is  completed  by  substituting  this  estimate  into  (14.29)  and  setting 

3/2  1  —  y 

3  —  4  y  -  -  min  { <Jm[n  (1  erm{n ) ,  crmax  (1  omax ) } . 

1  +  Y 


□ 


We  conclude  with  a  result  that  a  reduction  of  a  factor  of  e  in  the  duality  measure  p 
can  be  obtained  in  0{n  log  1/e)  iterations. 

Theorem  14.4. 

Given  e  e  (0,  1)  and  y  e  (0,  1),  suppose  the  starting  point  in  Algorithm  14.2  satisfies 
(x°,A.°,5°)  e  Af-oo(y)-  Then  there  is  an  index  K  with  K  =  O  (n  log  1  /  e)  such  that 

Pk  <  epo,  for  all  k  >  K. 

PROOF.  By  taking  logarithms  of  both  sides  in  ( 14.25),  we  obtain 

\o%pk+i  <  log  (l  -  +  log  pk. 


By  applying  this  formula  repeatedly,  we  have 


log  Pk  <  k  log  l^1  -  -J  +  log  p0. 

The  following  well-known  estimate  for  the  log  function, 

log(l  +  fi)  <  for  all  yS  >  —  1 , 

implies  that 

/  S' 

log(/Tt/Mo)  <  k  I  - 

Therefore,  the  condition  p^  / po  <  e  is  satisfied  if  we  have 

k  f--)  <  loge. 


This  inequality  holds  for  all  k  that  satisfy 


def  n  1  n 
k>K=-  log  -  =  —  |  log  e  | , 
o  €  o 


□ 


so  the  proof  is  complete. 
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1 4.2  PRACTICAL  PRIMAL-DUAL  ALGORITHMS 


Practical  implementations  of  interior-point  algorithms  follow  the  spirit  of  the  previous 
section,  in  that  strict  positivity  of  xk  and  sk  is  maintained  throughout  and  each  step  is  a 
Newton-like  step  involving  a  centering  component.  However,  most  implementations  work 
with  an  infeasible  starting  point  and  infeasible  iterations.  Several  aspects  of  “theoretical” 
algorithms  are  typically  ignored,  while  several  enhancements  are  added  that  have  a  significant 
effect  on  practical  performance.  In  this  section,  we  describe  the  algorithmic  enhancements 
that  are  found  in  a  typical  implementation  of  an  infeasible-interior-point  method,  and 
present  the  resulting  method  as  Algorithm  14.3.  Many  of  the  techniques  of  this  section  are 
described  in  the  paper  of  Mehrotra  [207],  which  can  be  consulted  for  further  details. 

CORRECTOR  AND  CENTERING  STEPS 

A  key  feature  of  practical  algorithms  is  their  use  of  corrector  steps  that  compensate  for 
the  linearization  error  made  by  the  Newton  (affine-scaling)  step  in  modeling  the  equation 
XjSj  —  0,  i  —  1,  2, . . . ,  n  (see  (14.3c)).  Consider  the  affine-scaling  direction  (Ax,  Aa,  As) 
defined  by 


— 1 

o 

k»h 

_ 1 

1 

> 

X 

_ 1 

~rc 

o 

o 

Akaff 

= 

~rb 

L  5  0  X 

1 

<1 

_ 1 

-XSe  _ 

(14.30) 


(where  r b  and  rc  are  defined  in  (14.7)).  If  we  take  a  full  step  in  this  direction,  we  obtain 


(xj  +  Axf)(si  +  Asf) 

=  XjSj  +  XiAsf  +  Si  Ax f  +  Ax f  Asf  =  Axf  Asf . 


That  is,  the  updated  value  of  x,,S’;  is  Axf  Asf  rather  than  the  ideal  value  0.  We  can  solve 
the  following  system  to  obtain  a  step  ( Axcor,  Akcor,  A scot)  that  attempts  to  correct  for  this 
deviation  from  the  ideal: 


- 1 

o 

_ 1 

Axcor 

0 

o 

o 

Akcor 

= 

0 

I - 

C/D 

O 

X 

i _ 

A.?cor 

1 

CO 

< 

X 

<1 

1 

_ i 

(14.31) 


In  many  cases,  the  combined  step  (Ax aff,  Akaff,  Aiaff)  +  (Axcor,  Akcor,  Ascor)  does  abetter 
job  of  reducing  the  duality  measure  than  does  the  affine-scaling  step  alone. 

Like  theoretical  algorithms  such  as  the  one  analysed  in  Section  14.1,  practical  algo¬ 
rithms  make  use  of  centering  steps,  with  an  adaptive  choice  of  the  centering  parameter  a \ . 
The  affine-scaling  step  can  be  used  as  the  basis  of  a  successful  heuristic  for  choosing  a 
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Roughly  speaking,  if  the  affine-scaling  step  (multiplied  by  a  steplength  to  maintain  non¬ 
negativity  of  x  and  s)  reduces  the  duality  measure  significantly,  there  is  not  much  need  for 
centering,  so  a  smaller  value  of  is  appropriate.  Conversely,  if  not  much  progress  can  be 
made  along  this  direction  before  reaching  the  boundary  of  the  nonnegative  orthant,  a  larger 
value  of  will  ensure  that  the  next  iterate  is  more  centered,  so  a  longer  step  will  be  possible 
from  this  next  point.  Specifically,  this  scheme  calculates  the  maximum  allowable  steplengths 
along  the  affine-scaling  direction  (14.30)  as  follows: 


pri  def 

«aff  =  mm 


(1,  min 

i:Axf*<0 


dual  (IH  .  | 

«aff  =  mm  I 


1,  min 

i:Asf<0 


As? 


aff 


(14.32a) 

(14.32b) 


and  then  defines  /xa g  to  be  the  value  of  /x  that  would  be  obtained  by  using  these  steplengths, 
that  is, 


Ataff  =  ( x  +  ap"  Axaff)r(s  +  a$al  As*e  )/n .  ( 14.33) 


The  centering  parameter  o  is  chosen  according  to  the  following  heuristic  (which  does  not 
have  a  solid  analytical  justification,  but  appears  to  work  well  in  practice): 


cr  = 


(14.34) 


To  summarize,  computation  of  the  search  direction  requires  the  solution  of  two  linear 
systems.  First,  the  system  (14.30)  is  solved  to  obtain  the  affine- scaling  direction,  also  known 
as  the  predictor  step.  This  step  is  used  to  define  the  right-hand  side  for  the  corrector  step  (see 
(14.31))  and  to  calculate  the  centering  parameter  from  (14.33),  (14.34).  Second,  the  search 
direction  is  calculated  by  solving 


— 1 

o 

_ 1 

Ax 

~rc 

o 

o 

AX 

= 

-n 

L  5  0  X  _ 

As 

-XSe  -  AXaff  A SaSe  +  ope 

(14.35) 


Note  that  the  predictor,  corrector,  and  centering  contributions  have  been  aggregated  on  the 
right-hand  side  of  this  system.  The  coefficient  matrix  in  both  linear  systems  ( 14.30)  and 

( 14.35)  is  the  same.  Thus,  the  factorization  of  the  matrix  needs  to  be  computed  only  once, 
and  the  marginal  cost  of  solving  the  second  system  is  relatively  small. 
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STEP  LENGTHS 


Practical  implementations  typically  do  not  enforce  membership  of  the  central  path 
neighborhoods  A/2  and  A/"_0 0  defined  in  the  previous  section.  Rather,  they  calculate  the 
maximum  steplengths  that  can  be  taken  in  the  x  and  s  variables  (separately)  without 
violating  nonnegativity,  then  take  a  steplength  of  slightly  less  than  this  maximum  (but  no 
greater  than  1).  Given  an  iterate  ( xk ,  Xk,  sk )  with  ( xk ,  sk)  >  0,  and  a  step  ( Axk,  A /A,  A  sk), 
it  is  easy  to  show  that  the  quantities  and  defined  as  follows: 


pri  def 
rv  — 

k,  max 


min 

i'.Axf  <0 


dual  def 
ak,  max  — 


min 

i'.Asf  <0 


(14.36) 


are  the  largest  values  of  a  forwhich.r^  +  aAxk  >  0  and  sk  +  a  Ask  >  0,  respectively.  (Note 
that  these  formulae  are  similar  to  the  ratio  test  used  in  the  simplex  method  to  determine 
the  index  that  enters  the  basis.)  Practical  algorithms  then  choose  the  steplengths  to  lie  in  the 
open  intervals  defined  by  these  maxima,  that  is, 


a 


pri 

k 


e 


(0,  a 


pri  , 
k,  max/’ 


dual 


(0,  a 


dual  \ 
k,  max/’ 


and  then  obtain  a  new  iterate  by  setting 

xk+l  =  xk  +  afAxk,  (Xk+\  sk+1)  =  (Xk,  sk)  +  aAdud(AXk,  A /). 

If  the  step  (Axk,  AXk,  A sk)  rectifies  the  infeasibility  in  the  KKT  conditions  (14.3a)  and 
(14.3b),  that  is, 


AAxk  =  -rkb  =  -{Axk  -  b),  AT AXk  +  A sk  =  -rk  =  ~{ArXk  +  sk  -  c), 

it  is  easy  to  show  that  the  infeasibilities  at  the  new  iterate  satisfy 

rkb+1  =  (1  -  «f )  d,  =  (l  -  «*-)  r*.  (14.37) 

The  following  formula  is  used  to  calculate  steplengths  in  many  practical 
implementations 


ak  ) 


«fal  =  min(l 


(14.38) 


where  e  [0.9,  1.0)  is  chosen  so  that  -a-  1  as  the  iterates  approach  the  primal-dual 
solution,  to  accelerate  the  asymptotic  convergence. 
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STARTING  POINT 

Choice  of  starting  point  is  an  important  practical  issue  with  a  significant  effect  on 
the  robustness  of  the  algorithm.  A  poor  choice  (x°,  X°,  s°)  satisfying  only  the  minimal 
conditions  x°  >  0  and  s°  >  0  often  leads  to  failure  of  convergence.  We  describe  here  a 
heuristic  that  finds  a  starting  point  that  satisfies  the  equality  constraints  in  the  primal  and 
dual  problems  reasonably  well,  while  maintaining  positivity  of  the  x  and  s  components  and 
avoiding  excessively  large  values  of  these  components. 

First,  we  find  a  vector  x  of  minimum  norm  satisfying  the  primal  constraint  Ax  =  b, 
and  a  vector  (X,  s )  satisfy  the  dual  constraint  AT X  +  s  —  c  such  that  s  has  minimum  norm. 
That  is,  we  solve  the  problems 

min  \xT  x  subject  to  Ax  —  b, 

X  Z 

min  \sTs  subject  to  AT X  +  s  —  c 

M  2 

It  is  not  difficult  to  show  that  x  and  ( X ,  s)  can  be  written  explicitly  as  follows: 

x  —  AT(AAT)~1b,  X  —  (AAT)~l  Ac,  s—c—ATX.  (14.40) 

In  general,  x  and  s  will  have  nonpositive  components,  so  are  not  suitable  for  use  as  a 
starting  point.  We  define 

Sx  —  max(— (3/2)  mini,,  0),  Ss  —  max(— (3/2)  min 0), 

i  i 

and  adjust  the  x  and  s  vectors  as  follows: 

x  —  x  +  Sxe,  s  —  s  +  8se, 

where,  as  usual,  e  =  (l,l,...,l)r.  Clearly,  we  have  x  >  0  and  s  >  0.  To  ensure  that  the 
components  of  x°  and  s°  are  not  too  close  to  zero  and  not  too  dissimilar,  we  add  two  more 
scalars  defined  as  follows: 


(14.39a) 

(14.39b) 


8r  = 


1  xTs 

2  eTs  ’ 


1  xTs 

2  eT x 


(Note  that  8X  is  the  average  size  of  the  components  of  Jc,  weighted  by  the  corresponding 
components  of  i1;  similarly  for  Ss.)  Finally,  we  define  the  starting  point  as  follows: 


x°  =  x  +  Sxe,  X°  =  X,  s°  —  s  +  Sse. 


The  computational  cost  of  finding  (x0,  A0,  s°)  by  this  scheme  is  about  the  same  as  one  step 
of  the  primal-dual  method. 
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In  some  cases,  we  have  prior  knowledge  about  the  solution,  possibly  in  the  form 
of  a  solution  to  a  similar  linear  program.  The  use  of  such  “warm-start”  information  in 
constructing  a  starting  point  is  discussed  in  Section  14.4. 

A  PRACTICAL  ALGORITHM 

We  now  give  a  formal  specification  of  a  practical  algorithm. 

Algorithm  14.3  (Predictor-Corrector  Algorithm  (Mehrotra  [207])). 

Calculate  (x°,  X°,  s°)  as  described  above; 

for  k  —  0,  1,  2,  . . . 

Set  (x,  X,  s)  —  ( xk ,  Xk,  sk)  and  solve  (14.30)  for  (Axaff,  Akaff,  Asaff); 

Calculate  a^1,  a(*f“al,  and  as  in  ( 14.32)  and  (14.33); 

Set  centering  parameter  to  a  —  (ju-aff //-03; 

Solve  (14.35)  for  (Ax,  AX ,  As); 

Calculate  a\n  and  a^ual  from  (14.38); 

Set 

x*+1  =xk+a?Ax, 

(Xk+\sk+1)  =  (Xk,  sk)  +  af^iAX,  As); 

end  (for). 

No  convergence  theory  is  available  for  Mehrotra’s  algorithm,  at  least  in  the  form  in 
which  it  is  described  above.  In  fact,  there  are  examples  for  which  the  algorithm  diverges. 
Simple  safeguards  could  be  incorporated  into  the  method  to  force  it  into  the  convergence 
framework  of  existing  methods  or  to  improve  its  robustness,  but  many  practical  codes  do 
not  implement  these  safeguards,  because  failures  are  rare. 

When  presented  with  a  linear  program  that  is  infeasible  or  unbounded,  the  algorithm 
above  typically  diverges,  with  the  infeasibilities  rj)  and  rk  and/or  the  duality  measure  Hk 
going  to  oo.  Since  the  symptoms  of  infeasibility  and  unboundedness  are  fairly  easy  to 
recognize,  interior-point  codes  contain  heuristics  to  detect  and  report  these  conditions. 
More  rigorous  approaches  for  detecting  infeasibility  and  unboundedness  make  use  of  the 
homogeneous  self-dual  formulation;  see  Wright  [316,  Chapter  9]  and  the  references  therein 
for  a  discussion.  A  more  recent  approach  that  applies  directly  to  infeasible-interior-point 
methods  is  described  by  Todd  [286]. 

SOLVING  THE  LINEAR  SYSTEMS 

Most  of  the  computational  effort  in  primal-dual  methods  is  taken  up  in  solving  linear 
systems  such  as  (14.9),  (14.30),  and  (14.35).  The  coefficient  matrix  in  these  systems  is  usually 
large  and  sparse,  since  the  constraint  matrix  A  is  itself  large  and  sparse  in  most  applications. 
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The  special  structure  in  the  step  equations  allows  us  to  reformulate  them  as  systems  with 
more  compact  symmetric  coefficient  matrices,  which  are  easier  and  cheaper  to  factor  than 
the  original  sparse  form. 

We  apply  the  reformulation  procedures  to  the  following  general  form  of  the  linear 
system: 


1 

o 

Ax 

A  0  0 

AX 

— 

SOX 

As 

(14.41) 


Since  x  and  .?  are  strictly  positive,  the  diagonal  matrices  X  and  S  are  nonsingular.  We  can 
eliminate  As  and  add  —  X~l  times  the  third  equation  in  this  system  to  the  first  equation  to 
obtain 


—D~2 

1 - 

e-H 

- 1 

> 

* 

_ i 

r c  H-  x  y xs 

A 

0  J 

i 

*< 

< 

_ 1 

-rb 

As  =  -X~lrxs  -  X~lS Ax, 


(14.42a) 

(14.42b) 


where  we  have  introduced  the  notation 

D  =  S-1/2Xl/2.  (14.43) 

This  form  of  the  step  equations  usually  is  known  as  the  augmented  system.  We  can  go  further 
and  eliminate  Ax  and  add  AD 2  times  the  first  equation  to  the  second  equation  in  (14.42a) 
to  obtain 


AD2At  AX  =  -rb  -  AXS~lrc  +  AS~lrxs  (14.44a) 

As  =  —rc  —  AT  AX,  (14.44b) 

Ax  —  —  S~lrxs  —  XS~lAs,  (14.44c) 

where  the  expressions  for  As  and  As  are  obtained  from  the  original  system  (14.41).  The 
form  (14.44a)  often  is  called  the  normal-equations  form,  because  the  system  (14.44a)  can 
be  viewed  as  the  normal  equations  (10.14)  for  a  certain  linear  least-squares  problem  with 
coefficient  matrix  DAr . 

Most  implementations  of  primal-dual  methods  are  based  on  formulations  like  ( 14.44) . 
They  use  direct  sparse  Cholesky  algorithms  to  factor  the  matrix  AD2AT,  and  then  perform 
triangular  solves  with  the  resulting  sparse  factors  to  obtain  the  step  AX  from  (14.44a). 
The  steps  As  and  Ax  are  recovered  from  (14.44b)  and  (14.44c).  General-purpose  sparse 
Cholesky  software  can  be  applied  to  AD2  A r,  but  modifications  are  needed  because  AD2AT 
may  be  ill-conditioned  or  singular.  (Ill  conditioning  of  this  system  is  often  observed  during 
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the  final  stages  of  a  primal-dual  algorithm,  when  the  elements  of  the  diagonal  weighting 
matrix  D2  take  on  both  huge  and  tiny  values.)  The  Cholesky  technique  may  encounter 
diagonal  elements  that  are  very  small,  zero  or  (because  of  roundoff  error)  slightly  negative. 
One  approach  for  handling  this  eventuality  is  to  skip  a  step  of  the  factorization,  setting 
the  component  of  AX  that  corresponds  to  the  faulty  diagonal  element  to  zero.  We  refer  to 
Wright  [317]  for  details  of  this  and  other  approaches. 

A  disadvantage  of  the  normal-equations  formulation  is  that  if  A  contains  any  dense 
columns,  the  entire  matrix  AD2AT  is  also  dense.  Hence,  practical  software  identifies 
dense  and  nearly-dense  columns,  excludes  them  from  the  matrix  product  AD2AT,  and 
performs  the  Cholesky  factorization  of  the  resulting  sparse  matrix.  Then,  a  device  such  as  a 
Sherman-Morrison-Woodbury  update  is  applied  to  account  for  the  excluded  columns.  We 
refer  the  reader  to  Wright  [316,  Chapter  11]  for  further  details. 

The  formulation  (14.42)  has  received  less  attention  than  (14.44),  mainly  because 
algorithms  and  software  for  factoring  sparse  symmetric  indefinite  matrices  are  more  com¬ 
plicated,  slower,  and  less  prevalent  than  sparse  Cholesky  algorithms.  Nevertheless,  the 
formulation  (14.42)  is  cleaner  and  more  flexible  than  (14.44)  in  a  number  of  respects. 
It  normally  avoids  the  fill-in  caused  by  dense  columns  in  A  in  the  matrix  product  AD2  AT . 
Moreover,  it  allows  free  variables  (components  of  a  with  no  explicit  lower  or  upper  bounds) 
to  be  handled  directly  in  the  formulation.  (The  normal  equations  form  must  resort  to  var¬ 
ious  artificial  devices  to  express  such  variables,  otherwise  it  is  not  possible  to  perform  the 
block  elimination  that  leads  to  the  system  ( 14.44a).) 


1 4.3  OTHER  PRIMAL-DUAL  ALGORITHMS  AND  EXTENSIONS 

OTHER  PATH-FOLLOWING  METHODS 

Framework  14.1  is  the  basis  of  a  number  of  other  algorithms  of  the  path-following 
variety.  They  are  less  important  from  a  practical  viewpoint,  but  we  mention  them  here 
because  of  their  elegance  and  their  strong  theoretical  properties. 

Some  path-following  methods  choose  conservative  values  for  the  centering  parameter 
a  (that  is,  a  only  slightly  less  than  1 )  so  that  unit  steps  (that  is,  a  steplength  of  a  —  1 )  can  be 
taken  along  the  resulting  direction  from  (14.16)  without  leaving  the  chosen  neighborhood. 
These  methods,  which  are  known  as  short-step  path- following  methods,  make  only  slow 
progress  toward  the  solution  because  they  require  the  iterates  to  stay  inside  a  restrictive  A/2 
neighborhood  (14.17).  From  a  theoretical  point  of  view,  however,  they  have  the  advantage 
of  better  complexity.  (A  result  similar  to  Theorem  14.4  holds  with  n  replaced  by  n 1/2  in  the 
complexity  estimate.) 

Better  results  are  obtained  with  the  predictor-corrector  method,  due  to  Mizuno,  Todd, 
and  Ye  [208],  which  uses  two  A/2  neighborhoods,  nested  one  inside  the  other.  (Despite  the 
similar  terminology,  this  algorithm  is  quite  distinct  from  Algorithm  14.3  of  Section  14.2.) 
Every  second  step  of  this  method  is  a  predictor  step,  which  starts  in  the  inner  neighborhood 
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and  moves  along  the  affine-scaling  direction  (computed  by  setting  a  —  0  in  (14.16))  to 
the  boundary  of  the  outer  neighborhood.  The  gap  between  neighborhood  boundaries  is 
wide  enough  to  allow  this  step  to  make  significant  progress  in  reducing  fi.  Alternating  with 
the  predictor  steps  are  corrector  steps  (computed  with  a  —  1  and  a  —  1),  which  take 
the  next  iterate  back  inside  the  inner  neighborhood  in  preparation  for  the  next  predictor 
step.  The  predictor-corrector  algorithm  produces  a  sequence  of  duality  measures  pk  that 
converge  superlinearly  to  zero,  in  contrast  to  the  linear  convergence  that  characterizes  most 
methods. 


POTENTIAL-REDUCTION  METHODS 

Potential-reduction  methods  take  steps  of  the  same  form  as  path-following  methods, 
but  they  do  not  explicitly  follow  the  central  path  C  and  can  be  motivated  independently  of 
it.  They  use  a  logarithmic  potential  function  to  measure  the  worth  of  each  point  in  T°  and 
aim  to  achieve  a  certain  fixed  reduction  in  this  function  at  each  iteration.  The  primal-dual 
potential  function,  which  we  denote  generically  by  O,  usually  has  two  important  properties: 


$  oo  if  x; Si  — >  0  for  some  i,  while  p  —  xT s/n  0,  (14.45a) 

<J>  ->  —  oo  if  and  only  if  (x,  4,  s)  — >  £2.  (14.45b) 

The  first  property  (14.45a)  prevents  anyone  of  the  pairwise  products  x,-j;  from  approaching 
zero  independently  of  the  others,  and  therefore  keeps  the  iterates  away  from  the  boundary 
of  the  nonnegative  orthant.  The  second  property  (14.45b)  relates  <J>  to  the  solution  set  L2. 
If  our  algorithm  forces  <J>  to  — oo,  then  (14.45b)  ensures  that  the  sequence  approaches  the 
solution  set. 

An  interesting  primal-dual  potential  function  is  defined  by 


n 

<*> p(x,s )  —  p\ogxTs  -  yiogXjSj,  (14.46) 

i=i 

for  some  parameter  p  >  n  (see  Tanabe  [283]  and  Todd  and  Ye  [287]).  Like  all  algorithms 
based  on  Framework  14.1,  potential-reduction  algorithms  obtain  their  search  directions  by 
solving  (14.10),  for  some  a*  e  (0,  1),  and  they  take  steps  of  length  o^,  along  these  directions. 
For  instance,  the  step  length  a a-  may  be  chosen  to  approximately  minimize  <J> p  along  the 
computed  direction.  By  fixing  =  n/(n  +  *Jn)  for  all  k,  one  can  guarantee  constant 
reduction  in  <J>/3  at  every  iteration.  Hence,  will  approach  — oo,  forcing  convergence. 
Adaptive  and  heuristic  choices  of  o>  and  a *  are  also  covered  by  the  theory,  provided  that 
they  at  least  match  the  reduction  in  obtained  from  the  conservative  theoretical  values  of 
these  parameters. 
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EXTENSIONS 

Primal- dual  methods  for  linear  programming  can  be  extended  to  wider  classes  of 
problems.  There  are  simple  extensions  of  the  algorithm  to  the  monotone  linear  comple¬ 
mentarity  problem  (LCP)  and  convex  quadratic  programming  problems  for  which  the 
convergence  and  polynomial  complexity  properties  of  the  linear  programming  algorithms 
are  retained.  The  monotone  LCP  is  the  problem  of  finding  vectors  x  and  s  in  R"  that  satisfy 
the  following  conditions: 


s  —  Mx+q,  (x,s)>0,  xTs  —  0,  (14.47) 

where  M  is  a  positive  semidefmite  n  x  n  matrix  and  q  e  R" .  The  similarity  between  ( 14.47) 
and  the  KKT  conditions  (14.3)  is  obvious:  The  last  two  conditions  in  (14.47)  correspond  to 
(14.3d)  and  (14.3c),  respectively,  while  the  conditions  =  Mx  +  q  is  similar  to  the  equations 
(14.3a)  and  (14.3b).  For  practical  instances  of  the  problem  (14.47),  see  Cottle,  Pang,  and 
Stone  [80].  Interior-point  methods  for  monotone  LCP  have  a  close  correspondence  to 
algorithms  for  linear  programming.  The  duality  measure  (14.6)  is  redefined  to  be  the 
complementarity  measure  (with  the  same  definition  p  —  xT s/n),  and  the  conditions  that 
must  be  satisfied  by  the  solution  can  be  stated  similarly  to  (14.4)  as  follows: 


Mx  +  q  —  s 
XSe 


—  0,  (x,  s)  >  0. 


The  general  formula  for  a  path-following  step  is  defined  analogously  to  ( 14.9)  as  follows: 


M 

-I 

Ax 

—  (Mx  +  q  —  s) 

S 

X 

As 

—XSe  +  ape 

where  cr  e  [0,  1  ].  Using  these  and  similar  adaptations,  an  extension  of  the  practical  method 
of  Section  14.2  can  also  be  derived. 

Extensions  to  convex  quadratic  programs  are  discussed  in  Section  16.6.  Their 
adaptation  to  nonlinear  programming  problems  is  the  subject  of  Chapter  19. 

Interior-point  methods  are  highly  effective  in  solving  semidefmite  programming  prob¬ 
lems,  a  class  of  problems  involving  symmetric  matrix  variables  that  are  constrained  to  be 
positive  semidefmite.  Semidefinite  programming,  which  has  been  the  topic  of  concentrated 
research  since  the  early  1990s,  has  applications  in  many  areas,  including  control  theory  and 
combinatorial  optimization.  Further  information  on  this  increasingly  important  topic  can 
be  found  in  the  survey  papers  of  Todd  [285]  and  Vandenberghe  and  Boyd  [292]  and  the 
books  of  Nesterov  and  Nemirovskii  [226],  Boyd  etal.  [37],  and  Boyd  and  Vandenberghe  [38]. 
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1 4.4  PERSPECTIVES  AND  SOFTWARE 


The  appearance  of  interior-point  methods  in  the  1980s  presented  the  first  serious  challenge 
to  the  dominance  of  the  simplex  method  as  a  practical  means  of  solving  linear  programming 
problems.  By  about  1 990,  interior-point  codes  had  emerged  that  incorporated  the  techniques 
described  in  Section  14.2  and  that  were  superior  on  many  large  problems  to  the  simplex 
codes  available  at  that  time.  The  years  that  followed  saw  significant  improvements  in  simplex 
software,  evidenced  by  the  appearance  of  packages  such  as  CPLEX  and  XPRESS-MP.  These 
improvements  were  due  to  algorthmic  advances  such  as  steepest-edge  pivoting  (see  Goldfarb 
and  Forrest  [133])  and  improved  pricing  heuristics,  and  also  to  close  attention  to  the  nuts 
and  bolts  of  efficient  implementation.  The  efficiency  of  interior-point  codes  also  continued 
to  improve,  through  improvements  in  the  linear  algebra  for  solving  the  step  equations  and 
through  the  use  of  higher-order  correctors  in  the  step  calculation  (see  Gondzio  [138]). 
During  this  period,  a  number  of  good  interior-point  codes  became  freely  available  (such 
as  PCx  [84],  HOPDM  [137],  BPMPD,  and  LIPSOL  [321])  and  found  their  way  into  many 
applications. 

In  general,  simplex  codes  are  faster  on  problems  of  small-medium  dimensions,  while 
interior-point  codes  are  competitive  and  often  faster  on  large  problems.  However,  this 
rule  is  certainly  not  hard-and-fast;  it  depends  strongly  on  the  structure  of  the  particular 
application.  Interior-point  methods  are  generally  not  able  to  take  full  advantage  of  prior 
knowledge  about  the  solution,  such  as  an  estimate  of  the  solution  itself  or  an  estimate  of 
the  optimal  basis.  Hence,  interior-point  methods  are  less  useful  than  simplex  approaches 
in  situations  in  which  “warm-start”  information  is  readily  available.  One  situation  of  this 
type  involves  branch-and-bound  algorithms  for  solving  integer  programs,  where  each  node 
in  the  branch-and-bound  tree  requires  the  solution  of  a  linear  program  that  differs  only 
slightly  from  one  already  solved  in  the  parent  node.  In  other  situations,  we  may  wish  to 
solve  a  sequence  of  linear  programs  in  which  the  data  is  perturbed  slightly  to  investigate 
sensitivity  of  the  solutions  to  various  perturbations,  or  in  which  we  approximate  a  non¬ 
linear  optimization  problem  by  a  sequence  of  linear  programs.  Yildirim  and  Wright  [319] 
describe  how  a  given  point  (such  as  an  approximate  solution)  can  be  modified  to  obtain 
a  starting  point  that  is  theoretically  valid,  in  that  it  allows  complexity  results  to  be  proved 
that  depend  on  the  quality  of  the  given  point.  In  practice,  however,  these  techniques  can 
be  expected  to  provide  only  a  modest  improvement  in  algorithmic  performance  (perhaps 
a  factor  of  between  2  and  5)  over  a  “cold”  starting  point  such  as  the  one  described  in 
Section  14.2. 

Interior-point  software  has  the  advantage  that  it  is  easy  to  program,  relative  to  the 
simplex  method.  The  most  complex  operation  is  the  solution  of  the  large  linear  systems 
at  each  iteration  to  compute  the  step;  software  to  perform  this  linear  algebra  operation  is 
readily  available.  The  interior-point  code  LIPSOL  [321]  is  written  entirely  in  the  Matlab 
language,  apart  from  a  small  amount  of  FORTRAN  code  that  interfaces  to  the  linear  algebra 
software.  The  code  PCx  [84]  is  written  in  C,  but  also  is  easy  for  the  interested  user  to 
comprehend  and  modify.  It  is  even  possible  for  a  non-expert  in  optimization  to  write  an 
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efficient  interior-point  implementation  from  scratch  that  is  customized  to  their  particular 
application. 


NOTES  AND  REFERENCES 

For  more  details  on  the  material  of  this  chapter,  see  the  book  by  Wright  [316], 

As  noted  in  the  text,  Karmarkar’s  method  arose  from  a  search  for  linear  programming 
algorithms  with  better  worst-case  behavior  than  the  simplex  method.  The  first  algorithm 
with  polynomial  complexity,  Khachiyan’s  ellipsoid  algorithm  [180],  was  a  computational 
disappointment.  In  contrast,  the  execution  times  required  by  Karmarkar’s  method  were 
not  too  much  greater  than  simplex  codes  at  the  time  of  its  introduction,  particularly  for 
large  linear  programs.  Karmarkar’s  is  a  primal  algorithm;  that  is,  it  is  described,  moti¬ 
vated,  and  implemented  purely  in  terms  of  the  primal  problem  (14.1)  without  reference 
to  the  dual.  At  each  iteration,  Karmarkar’s  algorithm  performs  a  projective  transforma¬ 
tion  on  the  primal  feasible  set  that  maps  the  current  iterate  xk  to  the  center  of  the 
set  and  takes  a  step  in  the  feasible  steepest  descent  direction  for  the  transformed  space. 
Progress  toward  optimality  is  measured  by  a  logarithmic  potential  function.  Descrip¬ 
tions  of  the  algorithm  can  be  found  in  Karmarkar’s  original  paper  [175]  and  in  Fletcher 
[101,  Section  8.7]. 

Karmarkar’s  method  falls  outside  the  scope  of  this  chapter,  and  in  any  case,  its  practi¬ 
cal  performance  does  not  appear  to  match  the  most  efficient  primal-dual  methods.  The 
algorithms  we  discussed  in  this  chapter  have  polynomial  complexity,  like  Karmarkar’s 
method. 

Many  of  the  algorithmic  ideas  that  have  been  examined  since  1984  actually  had  their 
genesis  in  three  works  that  preceded  Karmarkar’s  paper.  The  first  of  these  is  the  book 
of  Fiacco  and  McCormick  [98]  on  logarithmic  barrier  functions  (originally  proposed  by 
Frisch  [115]),  which  proves  existence  of  the  central  path,  among  many  other  results.  Further 
analysis  of  the  central  path  was  carried  out  by  McLinden  [205],  in  the  context  of  nonlinear 
complementarity  problems.  Finally,  there  is  Dikin’s  paper  [94],  in  which  an  interior-point 
method  known  as  primal  affine-scaling  was  originally  proposed.  The  outburst  of  research  on 
primal-dual  methods,  which  culminated  in  the  efficient  software  packages  available  today, 
dates  to  the  seminal  paper  of  Megiddo  [206]. 

Todd  gives  an  excellent  survey  of  potential  reduction  methods  in  [284] .  He  relates 
the  primal-dual  potential  reduction  method  mentioned  above  to  pure  primal  potential 
reduction  methods,  including  Karmarkar’s  original  algorithm,  and  discusses  extensions  to 
special  classes  of  nonlinear  problems. 

For  an  introduction  to  complexity  theory  and  its  relationship  to  optimization,  see  the 
book  by  Vavasis  [297] . 

Andersen  et  al.  [6]  cover  many  of  the  practical  issues  relating  to  implementation  of 
interior-point  methods.  In  particular,  they  describe  an  alternative  scheme  for  choosing  the 
initial  point,  for  the  case  in  which  upper  bounds  are  also  present  on  the  variables. 
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&  Exercises 

&  14.1  This  exercise  illustrates  the  fact  that  the  bounds  (x,  s)  >  0  are  essential  in 

relating  solutions  of  the  system  (14.4a)  to  solutions  of  the  linear  program  (14.1)  and  its 
dual.  Consider  the  following  linear  program  in  R2: 

min  xi,  subject  to  Xi+X2=l,  (xi,X2)>0. 

Show  that  the  primal-dual  solution  is 


Also  verify  that  the  system  F(x,  X,  s)  —  0  has  the  spurious  solution 


which  has  no  relation  to  the  solution  of  the  linear  program. 

&  14.2 

(i)  Show  that  A4 (0i )  C  AfiiOi)  when  0  <  9i  <  02  <  1  and  that  Af-oo(yi)  C  N-ooiyi) 
for  0  <  y2  <  Y\  <  1. 

(ii)  Show  that  A/2  (0)  C  Jf-oo(y)  if  K  <  1  —  0. 

&  14.3  Given  an  arbitrary  point  (x,  X,  s)  e  T° ,  find  the  range  of  y  values  for  which 

(x,  X,  s)  e  Af-ooiy).  (The  range  depends  onx  and  s.) 

&  14.4  For  n  —  2,  find  a  point  (x,  s)  >  0  for  which  the  condition 

||XSe  —  lie || 2  <  0 ii 

is  not  satisfied  for  any  0  e  [0,  1). 

&  14.5  Prove  that  the  neighborhoods  A/looCl)  (see  (14.18))  and  7^(0)  (see  (14.17)) 

coincide  with  the  central  path  C. 

&  14.6  In  the  long-step  path- following  method  (Algorithm  14.2),  give  a  procedure  for 

calculating  the  maximum  value  of  a  such  that  (14.20)  is  satisfied. 

14.7  Show  that  <t>p  defined  by  (14.46)  has  the  property  (14.45a). 

14.8  Prove  that  the  coefficient  matrix  in  ( 14.16)  is  nonsingular  if  and  only  if  A  has 
full  row  rank. 
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14.9  Given  (Ax,  AX,  As)  satisfying  (14.10),  prove  (14.23). 

i#7  14.10  Given  an  iterate  (. xk ,  Xk ,  sk)  with  ( xk ,  sk)  >  0,  show  that  the  quantities  a  max 

and  defined  by  (14.36)  are  the  largest  values  of  a  such  that  xk  +  aAxk  >  0  and 
sk  +  a  A sk  >  0,  respectively. 

i#7  14.11  Verify  (14.37). 

i#7  14.12  Given  that  X  and  S  are  diagonal  with  positive  diagonal  elements,  show  that 

the  coefficient  matrix  in  (14.44a)  is  symmetric  and  positive  definite  if  and  only  if  A  has  full 
row  rank.  Does  this  result  continue  to  hold  if  we  replace  D  by  a  diagonal  matrix  in  which 
exactly  m  of  the  diagonal  elements  are  positive  and  the  remainder  are  zero?  (Here  m  is  the 
number  of  rows  of  A.) 

i#7  14.13  Given  a  point  (x,  X,  s)  with  (x,  s)  >  0,  consider  the  trajectory  TL  defined  by 


F  ^x(r),  X(r),  s(t)J 


(1  —  r  )(ArX  +  s  —  c) 
(1  —  r  )(Ax  —  b) 

(1  -  r )XSe 


(x(r),  5(t))  >  0, 


for  r  e  [0,1],  and  note  that  ^x(0),  X(0),  i(0)^  =  (x,X,s),  while  the  limit  of 

^x(r),  X(r),  i(r)^  as  r  f  1  will  lie  in  the  primal-dual  solution  set  of  the  linear  pro¬ 
gram.  Find  equations  for  the  first,  second,  and  third  derivatives  of  TL  with  respect  to 
r  at  r  =  0.  Hence,  write  down  a  Taylor  series  approximation  to  TL  near  the  point 
(x,  X,  s). 

i#7  14.14  Consider  the  following  linear  program,  which  contains  “free  variables”  denoted 

by  j: 


mincrx  +  dTy,  subject  to  Aix  +  A2y  —  b,x>  0. 


By  introducing  Lagrange  multipliers  X  for  the  equality  constraints  and  s  for  the  bounds 
x  >  0,  write  down  optimality  conditions  for  this  problem  in  an  analogous  fashion  to  ( 14.3). 
Following  (14.4)  and  (14.16),  use  these  conditions  to  derive  the  general  step  equations  for 
a  primal-dual  interior-point  method.  Express  these  equations  in  augmented  system  form 
analogously  to  ( 14.42)  and  explain  why  it  is  not  possible  to  reduce  further  to  a  formulation 
like  (14.44)  in  which  the  coefficient  matrix  is  symmetric  positive  definite. 

i#7  14.15  Program  Algorithm  14.3  in  Matlab.  Choose  i]  —  0.99  uniformly  in  (14.38). 

Test  your  code  on  a  linear  programming  problem  (14.1)  generated  by  choosing  A  randomly, 
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and  then  setting  x,s,b,  and  c  as  follows: 

I  random  positive  number 
0 

random  positive  number 
0 

X  —  random  vector, 
c  —  A^  X  -\~  s, 
b  —  Ax. 

Choose  the  starting  point  ( x° ,  X°,  s°)  with  the  components  of  x°  and  s°  set  to  large  positive 
values. 

14.17  Show  that  the  solutions  of  the  problems  (14.39)  are  given  explicitly  by  ( 14.40). 


i  =  1,2 , ,m, 
i  —  m  +  1, m  +  2, 
i  —  m  +  1,  m  +  2,  . . . ,  n 
i  —  1,2 , ...  ,m, 


Chapter 


Fundamentals 
of  Algorithms 
for  Nonlinear 
Constrained 
Optimization 

In  this  chapter,  we  begin  our  discussion  of  algorithms  for  solving  the  general  constrained 
optimization  problem 


Ci(x)  =  0,  i  e  £, 

min  f (x)  subjectto  (15.1) 

-veR"  c,-(x)  >  0,  i  el, 

where  the  objective  function  /  and  the  constraint  functions  c;  are  all  smooth,  real- valued 
functions  on  a  subset  of  R" ,  and  X  and  £  are  finite  index  sets  of  inequality  and  equality 
constraints,  respectively.  In  Chapter  12,  we  used  this  general  statement  of  the  problem 
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to  derive  optimality  conditions  that  characterize  its  solutions.  This  theory  is  useful  for 
motivating  the  various  algorithms  discussed  in  the  remainder  of  the  book,  which  differ  from 
each  other  in  fundamental  ways  but  are  all  iterative  in  nature.  They  generate  a  sequence  of 
estimates  of  the  solution  x*  that,  we  hope,  tend  toward  a  solution.  In  some  cases,  they  also 
generate  a  sequence  of  guesses  for  the  Lagrange  multipliers  associated  with  the  constraints. 
As  in  the  chapters  on  unconstrained  optimization,  we  study  only  algorithms  for  finding  local 
solutions  of  ( 15. 1 );  the  problem  of  finding  a  global  solution  is  outside  the  scope  of  this  book. 

We  note  that  this  chapter  is  not  concerned  with  individual  algorithms  themselves, 
but  rather  with  fundamental  concepts  and  building  blocks  that  are  common  to  more  than 
one  algorithm.  After  reading  Sections  15.1  and  15.2,  the  reader  may  wish  to  glance  at  the 
material  in  Sections  15.3, 15.4,  15.5,  and  15.6,  and  return  to  these  sections  as  needed  during 
study  of  subsequent  chapters. 


15.1  CATEGORIZING  OPTIMIZATION  ALGORITHMS 


We  now  catalog  the  algorithmic  approaches  presented  in  the  rest  of  the  book.  No  standard 
taxonomy  exists  for  nonlinear  optimization  algorithms;  in  the  remaining  chapters  we  have 
grouped  the  various  approaches  as  follows. 

I.  In  Chapter  16  we  study  algorithms  for  solving  quadratic  programming  problems.  We 
consider  this  category  separately  because  of  its  intrinsic  importance,  because  its  particular 
characteristics  can  be  exploited  by  efficient  algorithms,  and  because  quadratic  programming 
subproblems  need  to  be  solved  by  sequential  quadratic  programming  methods  and  certain 
interior-point  methods  for  nonlinear  programming.  We  discuss  active  set,  interior-point, 
and  gradient  projection  methods. 

II.  In  Chapter  17  we  discuss  penalty  and  augmented  Lagrangian  methods.  By  combining  the 
objective  function  and  constraints  into  a  penalty  function,  we  can  attack  problem  ( 15. 1)  by 
solving  a  sequence  of  unconstrained  problems.  For  example,  if  only  equality  constraints  are 
present  in  (15.1),  we  can  define  the  quadratic  penalty  function  as 

f(x)  +  y  X!c'2('t)’  (15-2) 

ie£ 


where  /x  >  0  is  referred  to  as  a  penalty  parameter.  We  minimize  this  unconstrained  function, 
for  a  series  of  increasing  values  of  /x,  until  the  solution  of  the  constrained  optimization 
problem  is  identified  to  sufficient  accuracy. 

Ifwe  use  an  exact  penalty  function,  it  maybe  possible  to  find  a  local  solution  of  ( 15. 1 )  by 
solving  a  single  unconstrained  optimization  problem.  For  the  equality-constrained  problem, 
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the  function  defined  by 


f(x)  +  k/U)|, 

ieS 


is  usually  an  exact  penalty  function,  for  a  sufficiently  large  value  of  /z  >  0.  Although  they 
often  are  nondifferentiable,  exact  penalty  functions  can  be  minimized  by  solving  a  sequence 
of  smooth  subproblems. 

In  augmented  Lagrangian  methods,  we  define  a  function  that  combines  the  properties 
of  the  Lagrangian  function  (12.33)  and  the  quadratic  penalty  function  (15.2).  This  so-called 
augmented  Lagrangian  function  has  the  following  form  for  equality-constrained  problems: 

CA(x,  X;  /z)  =  fix)  -  ^  A,c,(x)  +  ^cf(x). 

ieS  ieS 

Methods  based  on  this  function  fix  X  to  some  estimate  of  the  optimal  Lagrange  multiplier 
vector  and  fix  p  to  some  positive  value,  then  find  a  value  of  x  that  approximately  minimizes 
CA(-,  X ;  /z).  At  this  new  x-iterate,  X  and  p  may  be  updated;  then  the  process  is  repeated. 
This  approach  avoids  certain  drawbacks  associated  with  the  minimization  of  the  quadratic 
penalty  function  ( 15.2). 

III.  In  Chapter  18  we  describe  sequential  quadratic  programming  (SQP)  methods,  which 
model  ( 1 5. 1 )  by  a  quadratic  programming  subproblem  at  each  iterate  and  define  the  search 
direction  to  be  the  solution  of  this  subproblem.  In  the  basic  SQP  method,  we  define  the 
search  direction  pk  at  the  iterate  {xk,  Xk)  to  be  the  solution  of 

min  \pTV2xxC{xk,  Xk)p  +  Vf{xk)Tp  (15.3a) 

subject  to  Vcj(xk)Tp  +  Cj(xk)  —  0,  ie£,  (15.3b) 

Vcj{xk)Tp  +  Ci(xk)  >  0,  i  eL,  (15.3c) 

where  £  is  the  Lagrangian  function  defined  in  (12.33).  The  objective  in  this  subproblem  is 
an  approximation  to  the  change  in  the  Lagrangian  function  in  moving  from  xk  to  xk  +  p, 
while  the  constraints  are  linearizations  of  the  constraints  in  (15.1).  A  trust-region  constraint 
may  be  added  to  (15.3)  to  control  the  length  and  quality  of  the  step,  and  quasi-Newton 
approximate  Hessians  can  be  used  in  place  of  V2x£{xk,  Xk).  In  a  variant  called  sequential 
linear-quadratic  programming,  the  step  pk  is  computed  in  two  stages.  First,  we  solve  a  linear 
program  that  is  defined  by  omitting  the  first  (quadratic)  term  from  the  objective  (15.3a) 
and  adding  a  trust-region  constraint  to  (15.3).  Next,  we  obtain  the  step  pk  by  solving  an 
equality-constrained  subproblem  in  which  the  constraints  active  at  the  solution  of  the  linear 
program  are  imposed  as  equalities,  while  all  other  constraints  are  ignored. 

IV.  In  Chapter  19  we  study  interior-point  methods  for  nonlinear  programming.  These  meth¬ 
ods  can  be  viewed  as  extensions  of  the  primal-dual  interior-point  methods  for  linear 
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programming  discussed  in  Chapter  14.  We  can  also  view  them  as  barrier  methods  that 
generate  steps  by  solving  the  problem 


min  f(x)  —  ii  log.?/  (15.4a) 

X,S  ' 

/  =  1 

subject  to  Ci(x)  —  0,  ie£,  (15.4b) 

Ci(x)  —  s,  =0,  i  e  X,  (15.4c) 

for  some  positive  value  of  the  barrier  parameter  /i,  where  the  variables  s >  0  are  slacks. 

Interior-point  methods  constitute  the  newest  class  of  methods  for  nonlinear  programming 
and  have  already  proved  to  be  formidable  competitors  of  sequential  quadratic  programming 
methods. 

The  algorithms  in  categories  I,  III,  and  IV  make  use  of  elimination  techniques,  in  which 
the  constraints  are  used  to  eliminate  some  of  the  degrees  of  freedom  in  the  problem.  As  a 
background  to  those  algorithms,  we  discuss  elimination  in  Section  15.3.  In  later  sections 
we  discuss  merit  functions  and  filters,  which  are  important  mechanisms  for  promoting 
convergence  of  nonlinear  programming  algorithms  from  remote  starting  points. 


1 5.2  THE  COMBINATORIAL  DIFFICULTY  OF 
INEQUALITY-CONSTRAINED  PROBLEMS 


One  of  the  main  challenges  in  solving  nonlinear  programming  problems  lies  in  dealing  with 
inequality  constraints — in  particular,  in  deciding  which  of  these  constraints  are  active  at 
the  solution  and  which  are  not.  One  approach,  which  is  the  essence  of  active-set  methods, 
starts  by  making  a  guess  of  the  optimal  active  set  .4*,  that  is,  the  set  of  constraints  that 
are  satisfied  as  equalities  at  a  solution.  We  call  our  guess  the  working  set  and  denote  it  by 
W.  We  then  solve  a  problem  in  which  the  constraints  in  the  working  set  are  imposed  as 
equalities  and  the  constraints  not  in  W  are  ignored.  We  then  check  to  see  if  there  is  a  choice 
of  Lagrange  multipliers  such  that  the  solution  x*  obtained  for  this  W  satisfies  the  KKT 
conditions  (12.34).  If  so,  we  accept  x*  as  a  local  solution  of  (15.1).  Otherwise,  we  make  a 
different  choice  of  W  and  repeat  the  process.  This  approach  is  based  on  the  observation 
that,  in  general,  it  is  much  simpler  to  solve  equality-constrained  problems  than  to  solve 
nonlinear  programs. 

The  number  of  choices  for  working  set  W  may  be  very  large — up  to  2|J|,  where  \I\ 
is  the  number  of  inequality  constraints.  We  arrive  at  this  estimate  by  observing  that  we  can 
make  one  of  two  choices  for  each  i  e  X:  to  include  it  in  W  or  leave  it  out.  Since  the  number  of 
possible  working  sets  grows  exponentially  with  the  number  of  inequalities — a  phenomenon 
which  we  refer  to  as  the  combinatorial  difficulty  of  nonlinear  programming — we  cannot 
hope  to  design  a  practical  algorithm  by  considering  all  possible  choices  for  W. 
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The  following  example  suggests  that  even  for  a  small  number  of  inequality  constraints, 
determination  of  the  optimal  active  set  is  not  a  simple  task. 


□  Example  1 5.1 

Consider  the  problem 

min  f(x,  y)  =  \(x  -  2)2  +  \(y  -  ±)2  (15.5) 

x,y  L  z  z 

(x  +  I)”1  -  y  -  |  >  0, 
subject  to  .r  >  0, 

y  >  0. 

We  label  the  constraints,  in  order,  with  the  indices  1  through  3.  Figure  15.1  illustrates  the 
contours  of  the  objective  function  (dashed  circles).  The  feasible  region  is  the  region  enclosed 
by  the  curve  and  the  two  axes.  We  see  that  only  the  first  constraint  is  active  at  the  solution, 
which  is  (x*,  y*)T  =  (1.953,  0.089)r. 

Let  us  now  apply  the  working-set  approach  described  above  to  (15.5),  considering  all 
23  =  8  possible  choices  of  W. 

We  consider  first  the  possibility  that  no  constraints  are  active  at  the  solution,  that  is, 
W  =  0.  Since  V/  —  (x  —  2,  y  —  \/2)T ,  we  see  that  the  unconstrained  minimum  of  /  lies 
outside  the  feasible  region.  Hence,  the  optimal  active  set  cannot  be  empty. 

There  are  seven  further  possibilities.  First,  all  three  constraints  could  be  active  (that  is, 
W  =  {1,  2,  3}).  A  glance  at  Figure  15.1  shows  that  this  does  not  happen  for  our  problem;  the 
three  constraints  do  not  share  a  common  point  of  intersection.  Three  further  possibilities  are 
obtained  by  making  a  single  constraint  active  (that  is,  W  —  {1},  W  =  {2},  and  >V  =  {3}), 


Figure  15.1  Graphical  illustration  of  problem  ( 15.5). 
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while  the  final  three  possibilities  are  obtained  by  making  exactly  two  constraints  active  (that 
is,  W  =  {1,  2},  W  =  {1,  3},  and  W  =  {2,  3}).  We  consider  three  of  these  cases  in  detail. 

-  W  =  {2};  that  is,  only  the  constraint  x  —  0  is  active.  If  we  minimize  /  enforcing  only 
this  constraint,  we  obtain  the  point  (0,  l/2)r.  A  check  of  the  KKT  conditions  (12.34) 
shows  that  no  matter  how  we  choose  the  Lagrange  multipliers,  we  cannot  satisfy  all 
these  conditions  at  (0,  l/2)r.  (We  must  have  At  =  A3  =  0  to  satisfy  (12.34e),  which 
implies  that  we  must  set  X2  =  —2  to  satisfy  (12.34a);  but  this  value  of  A2  violates  the 
condition  (12.34d).) 

-  W  =  {1,  3},  which  yields  the  single  feasible  point  (3,  0)r.  Since  constraint  2  is 
inactive  at  this  point,  we  have  X2  —  0,  so  by  solving  (12.34a)  for  the  other  Lagrange 
multipliers,  we  obtain  Ai  =  —16  and  A3  =  —16.5.  These  values  are  negative,  so  they 
violate  (12.34d),  and.v  =  (3,  0)r  cannot  be  a  solution  of  (15.1). 

-  W  —  {1}.  Solving  the  equality- constrained  problem  in  which  the  first  constraint  is 
active,  we  obtain  (x,  y)T  =  (1.953,  0.089)r  with  Lagrange  multiplier  Aj  =  0.411.  It 
is  easy  to  see  that  by  setting  A2  =  A3  =  0,  the  remaining  KKT  conditions  (12.34)  are 
satisfied,  so  we  conclude  that  this  is  a  KKT  point.  Furthermore,  it  is  easy  to  show  that 
the  second-order  sufficient  conditions  are  satisfied,  as  the  Hessian  of  the  Lagrangian 
is  positive  definite. 


Even  for  this  small  example,  we  see  that  it  is  exhausting  to  consider  all  possible  choices 
for  W.  Figure  15.1  suggests,  however,  that  some  choices  of  W  can  be  eliminated  from 
consideration  if  we  make  use  of  knowledge  of  the  functions  that  define  the  problem,  and 
their  derivatives.  In  fact,  the  active  set  methods  described  in  Chapter  16  use  this  kind  of 
information  to  make  a  series  of  educated  guesses  for  the  working  set,  avoiding  choices  of  W 
that  obviously  will  not  lead  to  a  solution  of  (15.1). 

A  different  approach  is  followed  by  interior-point  (or  barrier)  methods  discussed  in 
Chapter  19.  These  methods  generate  iterates  that  stay  away  from  the  boundary  of  the  feasible 
region  defined  by  the  inequality  constraints.  As  the  solution  of  the  nonlinear  program  is 
approached,  the  barrier  effects  are  weakened  to  permit  an  increasingly  accurate  estimate  of 
the  solution.  In  this  manner,  interior-point  methods  avoid  the  combinatorial  difficulty  of 
nonlinear  programming. 


1 5.3  ELIMINATION  OF  VARIABLES 


When  dealing  with  constrained  optimization  problems,  it  is  natural  to  try  to  use  the  con¬ 
straints  to  eliminate  some  of  the  variables  from  the  problem,  to  obtain  a  simpler  problem 
with  fewer  degrees  of  freedom.  Elimination  techniques  must  be  used  with  care,  however,  as 
they  may  alter  the  problem  or  introduce  ill  conditioning. 
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Figure  15.2 

The  danger  of  nonlinear 
elimination. 


We  begin  with  an  example  in  which  it  is  safe  and  convenient  to  eliminate  variables.  In 
the  problem 


min  /(x)  =  f(x\,  x-i,  X3,  X4)  subject  to  X\  +  x\  —  X4X3  =  0, 

—X-2  +  X4  +  X3  =  0, 


there  is  no  risk  in  setting 


Xi  =  X4X3  —  X3  ,  X2  =  X4  +  X3  , 

to  obtain  a  function  of  two  variables 

h  (X3 ,  X4)  =  /(x 4X3  —  X3,  X4  +  X3  ,  X3,  X4), 

which  we  can  minimize  using  the  unconstrained  optimization  techniques  described  in 
earlier  chapters. 

The  dangers  of  nonlinear  elimination  are  illustrated  in  the  following  example. 


□  Example  15.2  (Fletcher  [101]) 

Consider  the  problem 

minx2  +  y2  subject  to  (x  —  l)3  =  y2. 

The  contours  of  the  objective  function  and  the  constraints  are  illustrated  in  Figure  15.2, 
which  shows  that  the  solution  is  (x,  >’)  =  (1,  0). 
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We  attempt  to  solve  this  problem  by  eliminating  y.  By  doing  so,  we  obtain 

h(x)  —  x2  +  (x  —  l)3. 

Clearly,  h{x)  — >  —  oo  as  x  — >  —  oo.  By  blindly  applying  this  transformation  we  may 
conclude  that  the  problem  is  unbounded,  but  this  view  ignores  the  fact  that  the  con¬ 
straint  (x  —  l)3  =  y 2  implicitly  imposes  the  bound  x  >  1  that  is  active  at  the  solution. 
Hence,  if  we  wish  to  eliminate  y,  we  should  explicitly  introduce  the  bound  x  >  1  into  the 
problem. 


This  example  shows  that  the  use  of  nonlinear  equations  to  eliminate  variables  may 
result  in  errors  that  can  be  difficult  to  trace.  For  this  reason,  nonlinear  elimination  is  not 
used  by  most  optimization  algorithms.  Instead,  many  algorithms  linearize  the  constraints 
and  apply  elimination  techniques  to  the  simplified  problem.  We  now  describe  systematic 
procedures  for  performing  variable  elimination  using  linear  constraints. 


SIMPLE  ELIMINATION  USING  LINEAR  CONSTRAINTS 

We  consider  the  minimization  of  a  nonlinear  function  subject  to  a  set  of  linear  equality 
constraints, 


min  f(x)  subject  to  Ax  =  b,  (15.6) 

where  A  is  an  m  x  n  matrix  with  m  <  n.  Suppose  for  simplicity  that  A  has  full  row  rank. 
(If  such  is  not  the  case,  we  find  either  that  the  problem  is  inconsistent  or  that  some  of  the 
constraints  are  redundant  and  can  be  deleted  without  affecting  the  solution  of  the  problem.) 
Under  this  assumption,  we  can  find  a  subset  of  m  columns  of  A  that  is  linearly  independent. 
If  we  gather  these  columns  into  an  m  x  m  matrix  B  and  define  an  n  x  n  permutation  matrix 
P  that  swaps  these  columns  to  the  first  m  column  positions  in  A,  we  can  write 

AP  =  [B\N],  (15.7) 


where  N  denotes  the  n  —  m  remaining  columns  of  A.  (The  notation  here  is  consistent 
with  that  of  Chapter  13,  where  we  discussed  similar  concepts  in  the  context  of  linear 
programming.)  We  define  the  subvectors  xB  e  R"!  and  xN  e  R'!_m  as  follows: 


x, 


=  PTx, 


(15.8) 
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and  call  xB  the  basic  variables  and  B  the  basis  matrix.  Noting  that  PPT  =  I,  we  can  rewrite 
the  constraint  Ax  —  b  as 


b  —  Ax  =  AP(Ptx)  —  Bx  B  +  Nxn. 

By  rearranging  this  formula,  we  deduce  that  the  basic  variables  can  be  expressed  as  follows: 

xB  =  B~lb  -  B-lNx*.  (15.9) 

We  can  therefore  compute  a  feasible  point  for  the  constraints  Ax  —  b  by  choosing  any  value 
of  xN  and  then  setting  xB  according  to  the  formula  (15.9).  The  problem  (15.6)  is  therefore 
equivalent  to  the  unconstrained  problem 


min  h{xH)  =  f  \P 
xN  \ 


B~lb  -  B~1Nxn 


(15.10) 


We  refer  to  the  substitution  in  (15.9)  as  simple  elimination  of  variables. 

This  discussion  shows  that  a  nonlinear  optimization  problem  with  linear  equality 
constraints  is,  from  a  mathematical  point  of  view,  the  same  as  an  unconstrained  problem. 
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Consider  the  problem 

min  sin(.ri  +  xf)  +  x\  +  -(X4  +  x\  +  .*6/2)  (15.11a) 

8xi  —  6x2  +  X3  +  9x4  +  4x5  =  6 

subject  to  (15.11b) 

3xi  +  2x2  —  X4  +  6x5  +  4x6  =  —4. 

By  defining  the  permutation  matrix  P  so  as  to  reorder  the  components  of  x  as  xT  — 
(x3,  x6,  Xi,  x2,  x4,  x5)r,  we  find  that  the  coefficient  matrix  AP  is 


1 

0 

^0 

1 

00 

9 

4 

0 

4 

3  2 

-1 

6 

The  basis  matrix  B  is  diagonal  and  therefore  easy  to  invert.  We  obtain  from  (15.9)  that 


*3 

x6 


8-6  9 

3  1  1 

4  2  4 
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-1 


(15.12) 
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By  substituting  for  X3  and  xg  in  (15.1  la),  the  problem  becomes 

min  sin(xi  +  X2)  +  (8x1  —  6x2  +  9x4  +  4x5  —  6)2  (15.13) 

Xi,X2,Xi,X3 

+  -(X4  +  X5  —  [(1/2)  +  (3/8)xi  +  (1/4)x2  —  (1/8)x'4  +  (3/4)xs]). 

We  could  have  chosen  two  other  columns  of  the  coefficient  matrix  A  (that  is,  two 
variables  other  than  X3  and  Xg)  as  the  basis  for  elimination  in  the  system  ( 15.1  lb),  but  the 
matrix  B~1N  would  not  have  been  so  simple. 


A  set  of  m  independent  columns  can  be  selected,  in  general,  by  means  of  Gaussian 
elimination.  In  the  parlance  of  linear  algebra,  we  can  compute  the  row  echelon  form  of  the 
matrix  and  choose  the  pivot  columns  as  the  columns  of  the  basis  B.  Ideally,  we  would  like 
B  to  be  easy  to  factor  and  well  conditioned.  A  technique  that  suits  these  purposes  is  a  sparse 
Gaussian  elimination  approach  that  attempts  to  preserve  sparsity  while  keeping  rounding 
errors  under  control.  A  well-known  implementation  of  this  algorithm  is  MA48  from  the 
HSL  library  [96].  As  we  discuss  below,  however,  there  is  no  guarantee  that  the  Gaussian 
elimination  process  will  identify  the  best  choice  of  basis  matrix. 

There  is  an  interesting  interpretation  of  the  simple  elimination-of- variables  approach 
that  we  have  just  described.  To  simplify  the  notation,  we  will  assume  from  now  on  that 
the  coefficient  matrix  is  already  given  to  us  so  that  the  basic  columns  appear  in  the  first  m 
positions,  that  is,  P  =  I . 

From  (15.8)  and  (15.9)  we  see  that  any  feasible  point  x  for  the  linear  constraints  in 
( 15.6)  can  be  written  as 


XN 


—  x  —  Yb  +  ZxN 


(15.14) 


where 


B"1 

-B~lN 

,  z  = 

0 

I 

(15.15) 


Note  that  Z  has  n  —  m  linearly  independent  columns  (because  of  the  presence  of  the  identity 
matrix  in  the  lower  block)  and  that  it  satisfies  AZ  —  0.  Therefore,  Z  is  a  basis  for  the  null 
space  of  A.  In  addition,  the  columns  of  Y  and  the  columns  of  Z  form  a  linearly  independent 
set.  We  note  also  from  (15.15),  (15.7)  that  Yb  is  a  particular  solution  of  the  linear  constraints 
Ax  =  b. 

In  other  words,  the  simple  elimination  technique  expresses  feasible  points  as  the  sum 
of  a  particular  solution  of  Ax  —  b  (the  first  term  in  (15.14))  plus  a  displacement  along  the 
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Figure  15.3  Simple  elimination,  showing  the  coordinate  relaxation  step  obtained  by 
choosing  the  basis  to  be  the  first  column  of  A. 


null  space  of  the  constraints  (the  second  term  in  (15.14)).  The  relations  (15.14),  (15.15) 
indicate  that  the  particular  Y b  solution  is  obtained  by  holding  n  —  m  components  of  x  at  zero 
while  relaxing  the  other  m  components  (the  ones  in  jtB)  until  they  reach  the  constraints.  The 
particular  solution  Yb  is  sometimes  known  as  the  coordinate  relaxation  step.  In  Figure  15.3, 
we  see  the  coordinate  relaxation  step  Y b  obtained  by  choosing  the  basis  matrix  B  to  be  the 
first  column  of  A.  If  we  were  to  choose  B  to  be  the  second  column  of  A,  the  coordinate 
relaxation  step  would  lie  along  the  X2  axis. 

Simple  elimination  is  inexpensive  but  can  give  rise  to  numerical  instabilities.  If  the 
feasible  set  in  Figure  15.3  consisted  of  a  line  that  was  almost  parallel  to  the  xi  axis,  the 
coordinate  relaxation  along  this  axis  would  be  very  large  in  magnitude.  We  would  then  be 
computing  x  as  the  difference  of  very  large  vectors,  giving  rise  to  numerical  cancellation. 
In  that  situation  it  would  be  preferable  to  choose  a  particular  solution  along  the  X2  axis, 
that  is,  to  select  a  different  basis.  Selection  of  the  best  basis  is,  therefore,  not  a  straightfor¬ 
ward  task  in  general.  To  overcome  the  dangers  of  an  excessively  large  coordinate  relaxation 
step,  we  could  define  the  particular  solution  Yb  as  the  minimum-norm  step  to  the  con¬ 
straints.  This  approach  is  a  special  case  of  more  general  elimination  strategies,  which  we  now 
describe. 


GENERAL  REDUCTION  STRATEGIES  FOR  LINEAR  CONSTRAINTS 

To  generalize  (15.14)  and  (15.15),  we  choose  matrices  Y  e  R"xm  and  z  e 
with  the  following  properties: 


[Y  |  Z]  e  R"x"  is  nonsingular,  AZ  =  0. 


(15.16) 
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These  properties  indicate  that,  as  in  ( 15.15),  the  columns  of  Z  are  a  basis  for  the  null  space 
of  A.  Since  A  has  full  row  rank,  so  does  A[T  \  Z]  =  [AY  |  0],  so  it  follows  that  the  m  x  m 
matrix  AT  is  nonsingular.  We  now  express  any  solution  of  the  linear  constraints  Ax  —  b  as 

x :  —  T  xY  +  Zxz,  (15.17) 

for  some  vectors  xY  e  R"'  and  xz  e  R'i_m.  By  substituting  (15.17)  into  the  constraints 
Ax  —  b,  we  obtain 


Ax  —  (AT)xY  —  b; 

hence  by  nonsingularity  of  AT,  xY  can  be  written  explicitly  as 

xY  =  (AY)~lb.  (15.18) 

By  substituting  this  expression  into  (15.17),  we  conclude  that  any  vector  x  of  the  form 

x  =  Y{AY)~lb  +  Zxz  (15.19) 


satisfies  the  constraints  Ax  =  b  for  any  choice  ofxz  e  R"  Therefore,  the  problem  (15.6) 
can  be  restated  equivalently  as  the  following  unconstrained  problem 

min  f(Y (AY)~1b  +  Zxz).  (15.20) 

xz 

Ideally,  we  would  like  to  choose  T  in  such  a  way  that  the  matrix  AT  is  as  well 
conditioned  as  possible,  since  it  needs  to  be  factorized  to  give  the  particular  solution 
Y(AY)~1b.  We  can  do  this  by  computing  T  and  Z  by  means  of  a  QR  factorization  of 
Ar,  which  has  the  form 


Arn  =  [  Qx  02  ] 


(15.21) 


where  [  Q\  Qi  ]  is  orthogonal.  The  submatrices  0i  and  02  have  orthonormal  columns 
and  are  of  dimension  n  x  m  and  n  x  (n  —  m),  while  R  is  m  x  m  upper  triangular  and 
nonsingular  and  n  is  an  m  x  m  permutation  matrix.  (See  the  discussion  following  (A.24) 
in  the  Appendix  for  further  details.)  We  now  define 


T  =  0i,  Z=02, 


(15.22) 


so  that  the  columns  of  T  and  Z  form  an  orthonormal  basis  of  R".  If  we  expand  (15.21)  and 
do  a  little  rearrangement,  we  obtain 


AY  =  TIRt,  AZ  —  0. 
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Therefore,  Y  and  Z  have  the  desired  properties,  and  the  condition  number  of  AT  is  the 
same  as  that  of  R,  which  in  turn  is  the  same  as  that  of  A  itself.  From  (15.19)  we  see  that  any 
solution  of  Ax  —  b  can  be  expressed  as 

v  =  QlR~TnTb  +  Q2x  z, 

for  some  vector  x7.  The  computation  R~TYlT b  can  be  carried  out  inexpensively,  at  the  cost 
of  a  single  triangular  substitution. 

A  simple  computation  shows  that  the  particular  solution  Q\R~T IlT b  can  also  be 
written  as  AT ( A AT)~1b.  This  vector  is  the  solution  of  the  following  problem: 

min||x||2  subject  to  Ax  —  b; 

that  is,  it  is  the  minimum-norm  solution  of  Ax  —  b.  See  Figure  15.5  for  an  illustration  of 
this  step. 

Elimination  via  the  orthogonal  basis  (15.22)  is  ideal  from  the  point  of  view  of  numer¬ 
ical  stability.  The  main  cost  associated  with  this  reduction  strategy  is  in  computing  the  QR 
factorization  ( 15.21).  Unfortunately,  for  problems  in  which  A  is  large  and  sparse,  a  sparse 
QR  factorization  can  be  much  more  costly  to  compute  than  the  sparse  Gaussian  elimina¬ 
tion  strategy  used  in  simple  elimination.  Therefore,  other  elimination  strategies  have  been 
developed  that  seek  a  compromise  between  these  two  techniques;  see  Exercise  15.7. 


Figure  15.4  General  elimination:  Case  in  which  A  e  Rlx3,  showing  the  particular 
solution  and  a  step  in  the  null  space  of  A. 
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EFFECT  OF  INEQUALITY  CONSTRAINTS 

Elimination  of  variables  is  not  always  beneficial  if  inequality  constraints  are  present 
alongside  the  equalities.  For  instance,  if  problem  (15.11)  had  the  additional  constraint 
x  >  0,  then  after  eliminating  the  variables  X3  and  ay,  we  would  be  left  with  the  problem  of 
minimizing  the  function  in  (15.13)  subject  to  the  constraints 


(Xi,  X2,  X4,  X5)  >  0, 

8x1  —  6x2  +  9x4  +  4x5  <  6, 
(3/4)jci  +  (l/2)*2  -  (l/4)x4  +  (3/2)x5  <  -1 


Hence,  the  cost  of  eliminating  the  equality  constraints  (15.11b)  is  to  make  the  inequalities 
more  complicated  than  the  simple  bounds  x  >  0.  For  many  algorithms,  this  transformation 
will  not  yield  any  benefit. 

If,  however,  problem  (15.11)  included  the  general  inequality  constraint  3xi  +2x3  >  1, 
the  elimination  (15.12)  would  transform  the  problem  into  one  of  minimizing  the  function 
in  (15.13)  subject  to  the  inequality  constraint 


—  13xi  +  12x2  —  I8X4  —  8x5  >  — 11. 


(15.23) 


In  this  case,  the  inequality  constraint  would  not  become  much  more  complicated  af¬ 
ter  elimination  of  the  equality  constraints,  so  it  is  probably  worthwhile  to  perform  the 
elimination. 
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15.4  MERIT  FUNCTIONS  AND  FILTERS 


Suppose  that  an  algorithm  for  solving  the  nonlinear  programming  problem  (15.1)  generates 
a  step  that  reduces  the  objective  function  but  increases  the  violation  of  the  constraints.  Should 
we  accept  this  step? 

This  question  is  not  easy  to  answer.  We  must  look  for  a  way  to  balance  the  twin  (often 
competing)  goals  of  reducing  the  objective  function  and  satisfying  the  constraints.  Merit 
functions  and  filters  are  two  approaches  for  achieving  this  balance.  In  a  typical  constrained 
optimization  algorithm,  a  step  p  will  be  accepted  only  if  it  leads  to  a  sufficient  reduction  in 
the  merit  function  cp  or  if  it  is  acceptable  to  the  filter.  These  concepts  are  explained  in  the 
rest  of  the  section. 


MERIT  FUNCTIONS 

In  unconstrained  optimization,  the  objective  function  /  is  the  natural  choice  for  the 
merit  function.  All  the  unconstrained  optimization  methods  described  in  this  book  require 
that  /  be  decreased  at  each  step  (or  at  least  within  a  certain  number  of  iterations) .  In  feasible 
methods  for  constrained  optimization  in  which  the  starting  point  and  all  subsequent  iterates 
satisfy  all  the  constraints  in  the  problem,  the  objective  function  is  still  an  appropriate  merit 
function.  On  the  other  hand,  algorithms  that  allow  iterates  to  violate  the  constraints  require 
some  means  to  assess  the  quality  of  the  steps  and  iterates.  The  merit  function  in  this  case 
combines  the  objective  with  measures  of  constraint  violation. 

A  popular  choice  of  merit  function  for  the  nonlinear  programming  problem  (15.1)  is 
the  i\  penalty  function  defined  by 

<Pdx;  p)  =  f{x)  +  fi^2\Cj(x)\  +  M^[cj(x)]~,  (15.24) 

ie£  iel 


where  we  use  the  notation  [z]_  =  max{0,  — z).  The  positive  scalar  p  is  the  penalty  parameter, 
which  determines  the  weight  that  we  assign  to  constraint  satisfaction  relative  to  minimization 
of  the  objective.  The  i\  merit  function  <f\  is  not  differentiable  because  of  the  presence  of  the 
absolute  value  and  [•]“  functions,  but  it  has  the  important  property  of  being  exact. 

Definition  15.1  (Exact  Merit  Function). 

A  merit  function  (p(x\  p)  is  exact  if  there  is  a  positive  scalar  p*  such  that  for  any  p  >  p*, 
any  local  solution  of  the  nonlinear  programming  problem  (15.1)  is  a  local  minimizer  of <p(x\  p). 

We  show  in  Theorem  17.3  that,  under  certain  assumptions,  the  l\  merit  function 
(p\ (x\  p)  is  exact  and  that  the  threshold  value  p*  is  given  by 


p*  =  max{|A.*|,  i  e£UI), 
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where  the  A*  denote  the  Lagrange  multipliers  associated  with  an  optimal  solution  x*.  Since 
the  optimal  Lagrange  multipliers  are,  however,  not  known  in  advance,  algorithms  based  on 
the  1 1  merit  function  contain  rules  for  adjusting  the  penalty  parameter  whenever  there  is 
reason  to  believe  that  it  is  not  large  enough  (or  is  excessively  large).  These  rules  depend  on 
the  choice  of  optimization  algorithm  and  are  discussed  in  the  next  chapters. 

Another  useful  merit  function  is  the  exact  £2  function,  which  for  equality-constrained 
problems  takes  the  form 


fa(x\  M)  =  /(*)  +  V\\c(x)\\2.  (15.25) 

This  function  is  nondifferentiable  because  the  2-norm  term  is  not  squared;  its  derivative  is 
not  defined  at  x  for  which  c(x )  =  0. 

Some  merit  functions  are  both  smooth  and  exact.  To  ensure  that  both  properties  hold, 
we  must  include  additional  terms  in  the  merit  function.  For  equality-constrained  problems, 
Fletcher’s  augmented  Lagratigiati  is  given  by 

fa(x;  h)  —  f(x)  -  X{x)Tc(x)  +  knT.cdx)2,  (15.26) 

ie£ 


where  /z  >  0  is  the  penalty  parameter  and 

X{x)  =  [A(x)A(x)r]_1  A(x)  V  f(x).  (15.27) 

(Here  A  (x )  denotes  the  Jacobian  of  c(x ).)  Although  this  merit  function  has  some  interesting 
theoretical  properties,  it  has  practical  limitations,  including  the  expense  of  solving  for  A.(jc) 
in  (15.27). 

A  quite  different  merit  function  is  the  (standard)  augmented  Lagrangian  in  x  and  X, 
which  for  equality-constrained  problems  has  the  form 

CA(x,  X;  n)  —  f{x)  —  XTc{x )  +  \n\\c{x)\\\.  (15.28) 

We  assess  the  acceptability  of  a  trial  point  (x+ , k+ ) by  comparing  the  value  of  CA  (x+ ,  k+ ;  /z) 
with  the  value  at  the  current  iterate,  (x,  X).  Strictly  speaking,  CA  is  not  a  merit  function  in 
the  sense  that  a  solution  (x* ,  X *)  of  the  nonlinear  programming  problem  is  not  in  general  a 
minimizer  of  CA(x,  X\  /z)  but  only  a  stationary  point.  Although  some  sequential  quadratic 
programming  methods  use  CA  successfully  as  a  merit  function  by  adaptively  modifying 
jjL  and  X,  we  will  not  consider  its  use  as  a  merit  function  further.  Instead,  we  will  focus 
primarily  on  the  nonsmooth  exact  penalty  functions  <p\  and  fa. 

A  trial  step  x+  —  x  +  ap  generated  by  a  line  search  algorithm  will  be  accepted  if  it 
produces  a  sufficient  decrease  in  the  merit  function  (p{x\  /z).  One  way  to  define  this  concept 
is  analogous  to  the  condition  (3.4)  used  in  unconstrained  optimization,  where  the  amount 
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of  decrease  is  not  too  small  relative  to  the  predicted  change  in  the  function  over  the  step. 
The  i\  and  £2  merit  functions  are  not  differentiable,  but  they  have  a  directional  derivative. 
(See  (A.51)  for  background  on  directional  derivatives.)  We  write  the  directional  derivative 
of  <p(x\  /x)  in  the  direction  p  as 


D(<p(x;  /x);  p). 

In  a  line  search  method,  the  sufficient  decrease  condition  requires  the  steplength  parameter 
a  >  0  to  be  small  enough  that  the  inequality 

<p{x  +  ap ;  pi)  <  <p{x\  pi)  +  iiaD{<p{x\  /x);  p),  (15.29) 

is  satisfied  for  some  q  e  (0,  1). 

Trust-region  methods  typically  use  a  quadratic  model  q(p)  to  estimate  the  value  of 
the  merit  function  (p  after  a  step  p;  see  Section  18.5.  The  sufficient  decrease  condition  can 
be  stated  in  terms  of  a  decrease  in  this  model,  as  follows 

(p{x  +  p\  /x)  <  <j>(x;  /x)  -  q(q{ 0)  -  q(p)),  (15.30) 

for  some  ;/  e  (0,  1).  (The  final  term  in  (15.30)  is  positive,  because  the  step  p  is  computed 
to  decrease  the  model  q.) 


FILTERS 

Filter  techniques  are  step  acceptance  mechanisms  based  on  ideas  from  multiobjective 
optimization.  Our  derivation  starts  with  the  observation  that  nonlinear  programming  has 
two  goals:  minimization  of  the  objective  function  and  the  satisfaction  of  the  constraints.  If 
we  define  a  measure  of  infeasibility  as 

h(x)  =  ^2  |c;(x)|  +  $>(*)]-  (15.31) 

ieS  iel 


we  can  write  these  two  goals  as 


min/(x)  and  min  h(x).  (15.32) 

X  X 

Unlike  merit  functions,  which  combine  both  problems  into  a  single  minimization  prob¬ 
lem,  filter  methods  keep  the  two  goals  in  (15.32)  separate.  Filter  methods  accept  a  trial 
step  x+  as  a  new  iterate  if  the  pair  (f{x+),h(x+))  is  not  dominated  by  a  previous 
pair  (//,  hi)  —  ( /(x; ),  /j(x/))  generated  by  the  algorithm.  These  concepts  are  defined  as 
follows. 
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Definition  15.2. 

(a)  A  pair  (fa,  lik)  is  said  to  dominate  another  pair  (// ,  hi)  if  both  fk  <  //  andh k  <  /;/. 

(b)  A  filter  is  a  list  of  pairs  ij).  hi)  such  that  no  pair  dominates  any  other. 

(c)  An  iterate  Xk  is  said  to  be  acceptable  to  the  filter  if{fk,  hk)  is  not  dominated  by  any  pair 
in  the  filter. 

When  an  iterate  Xk  is  acceptable  to  the  filter,  we  (normally)  add  (/a-,  hk)  to  the  filter 
and  remove  any  pairs  that  are  dominated  by  (fk.  hk).  Figure  15.6  shows  a  filter  where  each 
pair  (//,  hi)  in  the  filter  is  represented  as  a  black  dot.  Every  point  in  the  filter  creates  an 
(infinite)  rectangular  region,  and  their  union  defines  the  set  of  pairs  not  acceptable  to  the 
filter.  More  specifically,  a  trial  point  x+  is  acceptable  to  the  filter  if  (/+,  h+)  lies  below  or  to 
the  left  of  the  solid  line  in  Figure  15.6. 

To  compare  the  filter  and  merit  function  approaches,  we  plot  in  Figure  15.7  the 
contour  line  of  the  set  of  pairs  (/,  h)  such  that  /  +  ph  —  fk  +  p  hk,  where  Xk  is  the  current 
iterate.  The  region  to  the  left  of  this  line  corresponds  to  the  set  of  pairs  that  reduce  the  merit 
function  </>(x;  p)  —  f(x)  +  ph(x);  clearly  this  set  is  quite  different  from  the  set  of  points 
acceptable  to  the  filter. 

If  a  trial  step  x+  —  Xk  +  oik  Pk  generated  by  a  line  search  method  gives  a  pair  (/+,  h+) 
that  is  acceptable  to  the  filter,  we  set  Xk+i  =  x+;  otherwise,  a  backtracking  line  search  is 
performed.  In  a  trust-region  method,  if  the  step  is  not  acceptable  to  the  filter,  the  trust 
region  is  reduced,  and  a  new  step  is  computed. 

Several  enhancements  to  this  filter  technique  are  needed  to  obtain  global  convergence 
and  good  practical  performance.  We  need  to  ensure,  first  of  all,  that  we  do  not  accept  a  point 
whose  ( /,  h)  pair  is  very  close  to  the  current  pair  ( fk ,  hk)  or  to  another  pair  in  the  filter.  We 


Figure  15.7  Comparing  the  filter  and  merit  function  techniques. 


do  so  by  modifying  the  acceptability  criterion  and  imposing  a  sufficient  decrease  condition. 
A  trial  iterate  x+  is  acceptable  to  the  filter  if,  for  all  pairs  (/),  hj)  in  the  filter,  we  have  that 

f(x+)<fj~phj  or  h(x+)  <  hj  —  f)h j,  (15.33) 

for  ft  e  (0,  1).  Although  this  condition  is  effective  in  practice  using,  say  ft  —  10-5,  for 
purposes  of  analysis  it  may  be  advantageous  to  replace  the  first  inequality  by 

/(*+)  <  fj  ~  Ph+. 

A  second  enhancement  addresses  some  problematic  aspects  of  the  filter  mechanism. 
Under  certain  circumstances,  the  search  directions  generated  by  line  search  methods  may 
require  arbitrarily  small  steplengths  a *  to  be  acceptable  to  the  filter.  This  phenomenon  can 
cause  the  algorithm  to  stall  and  fail.  To  guard  against  this  situation,  if  the  backtracking 
line  search  generates  a  steplength  that  is  smaller  than  a  given  threshold  amin,  the  algorithm 
switches  to  a  feasibility  restoration  phase,  which  we  describe  below.  Similarly,  in  a  trust-region 
method,  if  a  sequence  of  trial  steps  is  rejected  by  the  filter,  the  trust-region  radius  may  be 
decreased  so  much  that  the  trust-region  subproblem  becomes  infeasible  (see  Section  18.5). 
In  this  case,  too,  the  feasibility  restoration  phase  is  invoked.  (Other  mechanisms  could  be 
employed  to  handle  this  situation,  but  as  we  discuss  below,  the  feasibility  restoration  phase 
can  help  the  algorithm  achieve  other  useful  goals.) 

The  feasibility  restoration  phase  aims  exclusively  to  reduce  the  constraint  violation, 
that  is,  to  find  an  approximate  solution  to  the  problem 


min  h{x). 
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Although  h(x)  defined  by  (15.31)  is  not  smooth,  we  show  in  Chapter  17  how  to  minimize 
it  using  a  smooth  constrained  optimization  subproblem.  This  phase  terminates  at  an  iterate 
that  has  a  sufficiently  small  value  of  h  and  is  compatible  with  the  filter. 

We  now  present  a  framework  for  filter  methods  that  assumes  that  iterates  are  generated 
by  a  trust-region  method;  see  Section  18.5  for  a  discussion  of  trust-region  methods  for 
constrained  optimization. 

Algorithm  15.1  (General  Filter  Method). 

Choose  a  starting  point  xq  and  an  initial  trust-region  radius  Ao; 

Set  k  ■*-  0; 

repeat  until  a  convergence  test  is  satisfied 

if  the  step-generation  subproblem  is  infeasible 

Compute  Xk+\  using  the  feasibility  restoration  phase; 

else 

Compute  a  trial  iterate  x+  —  Xk  +  Pk ; 
if  (/+,  h+)  is  acceptable  to  the  filter 

Set  Xk+ 1  =  x+  and  add  ( fk+  \,hk+i)  to  the  filter; 

Choose  A/..+i  such  that  A*+i  >  A *; 

Remove  all  pairs  from  the  filter  that  are  dominated 
by  (fk+u  hk+i); 

else 

Reject  the  step,  set  Xk+\  —  Xk', 

Choose  A/,+1  <  Ak; 

end  if 

end  if 

k  < —  k  -|-  1; 

end  repeat 


Other  enhancements  of  this  simple  filter  framework  are  used  in  practice;  they  depend 
on  the  choice  of  algorithm  and  will  be  discussed  in  subsequent  chapters. 


15.5  THE  MARATOS  EFFECT 


Some  algorithms  based  on  merit  functions  or  filters  may  fail  to  converge  rapidly  because  they 
reject  steps  that  make  good  progress  toward  a  solution.  This  undesirable  phenomenon  is 
often  called  the  Maratos  effect ,  because  it  was  first  observed  by  Maratos  [199] .  It  is  illustrated 
by  the  following  example,  in  which  steps  pk,  which  would  yield  quadratic  convergence  if 
accepted,  cause  an  increase  both  in  the  objective  function  value  and  the  constraint  violation. 
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Figure  15.8 

Maratos  Effect:  Example  15.4. 
Note  that  the  constraint  is  no 
longer  satisfied  after  the  step 
from  Xk  to  Xk  +  pu ,  and  the 
objective  value  has  increased. 


□  Example  15.4  (Powell  [255]) 

Consider  the  problem 


min  /(xi,  x2)  —  2(x2  +  x\  —  1)  —  Xi,  subject  to  x\  +  x\  —  1  =  0.  (15.34) 


One  can  verify  (see  Figure  15.8)  that  the  optimal  solution  is  x*  =  (l,0)r,  that  the 
corresponding  Lagrange  multiplier  is  A*  =  and  that  V2x£(x*,  A*)  =  I . 

Let  us  consider  an  iterate  Xk  of  the  form  x*  =  (cos  9,  sin  0)T ,  which  is  feasible  for  any 
value  of  6.  Suppose  that  our  algorithm  computes  the  following  step: 


Pk 


sin2  0 


—  sin  6  cos  0 


(15.35) 


which  yields  a  trial  point 

/  cos  9  +  sin2  9  \ 

Xk  +  Pk  —  I  1  ■ 

y  sin0(l  —  cost?)  J 

By  using  elementary  trigonometric  identities,  we  have  that 

II xk  +  pk  x* || 2  =  2shr(0/2),  ||x*  -x*||2  =  2|sin(0/2)|, 


442  Chapter  15.  Fundamentals  of  Constrained  Algorithms 


and  therefore 


II xk  +  Pk  ~  x*\\2  _  1 
\\xk-x*W\  2' 

Hence,  this  step  approaches  the  solution  at  a  rate  consistent  with  Q-quadratic  convergence. 
However,  we  have  that 

f(xk  +  Pk)  =  sin2 6  —  cost?  >  —  cos 6  —  f(xk), 
c(xk  +  Pk)  =  sin2  6  >  c(xk )  =  0, 

so  that,  as  can  be  seen  in  Figure  15.8,  both  the  objective  function  value  and  the  constraint 
violation  increase  over  this  step.  This  behavior  occurs  for  any  nonzero  value  of  6,  even  if  the 
initial  point  is  arbitrarily  close  to  the  solution. 


On  the  example  above,  any  algorithm  that  requires  reduction  of  a  merit  function  of 
the  form 


<j>(x;  p)  —  f{x)  +  nh{c(x)), 

where  /?(•)  is  a  nonnegative  function  satisfying  h( 0)  =  0,  will  reject  the  good  step  (15.35). 
(Examples  of  such  merit  functions  include  the  <p\  and  <p2  penalty  functions.)  The  step 
(15.35)  will  also  be  rejected  by  the  filter  mechanism  described  above  because  the  pair 
ifixk  +  Pk),  h{xk  +  Pk))  is  dominated  by  (/*,  hk).  Therefore,  all  these  approaches  will 
suffer  from  the  Maratos  effect. 

If  no  remedial  measures  are  taken,  the  Maratos  effect  can  slow  optimization  meth¬ 
ods  by  interfering  with  good  steps  away  from  the  solution  and  by  preventing  superlinear 
convergence.  Strategies  for  avoiding  the  Maratos  effect  include  the  following. 

1.  We  can  use  a  merit  function  that  does  not  suffer  from  the  Maratos  effect.  An  example  is 
Fletcher’s  augmented  Lagrangian  function  (15.26). 

2.  We  can  use  a  second-order  correction  in  which  we  add  to  pk  a  step  pk,  which  is  computed 
at  c(xk  +  pk)  and  which  decreases  the  constraint  violation. 

3.  We  can  allow  the  merit  function  f  to  increase  on  certain  iterations;  that  is,  we  can  use  a 
nonmonotone  strategy. 

We  discuss  the  last  two  approaches  in  the  next  section. 
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1 5.6  SECOND-ORDER  CORRECTION  AND  NONMONOTONE 
TECHNIQUES 


By  adding  a  correction  term  that  decreases  the  constraint  violation,  various  algorithms 
are  able  to  overcome  the  difficulties  associated  with  the  Maratos  effect.  We  describe  this 
technique  with  respect  to  the  equality-constrained  problem,  in  which  the  constraint  is 
c(x)  —  0,  where  c  :  R"  — >  R1^  L 

Given  a  step  pk,  the  second-order  correction  step  pk  is  defined  to  be 


pk  =  -ATk{AkATk)  lc(xk+Pk),  (15.36) 

where  Ak  =  A(xk)  is  the  Jacobian  of  c  at  xk.  Note  that  pk  has  the  property  that  it  satisfies  a 
linearization  of  the  constraint  c  at  the  point  xk  +  Pk->  that  is, 


AkPk  +  c{xk  +  Pk)  —  0. 


In  fact,  pk  is  the  minimum-norm  solution  of  this  equation.  (A  different  interpretation  of 
the  second-order  correction  is  given  in  Section  18.3.) 

The  effect  of  the  correction  step  pk  is  to  decrease  the  quantity  ||c(x)||  to  the  order 
of  || xk  —  x*||3,  provided  the  primary  step  pk  satisfies  Akpk  +  c(xk)  =  0.  This  estimate 
indicates  that  the  step  from  from  xk  to  xk  +  Pk  +  Pk  will  decrease  the  merit  function,  at 
least  near  the  solution.  The  cost  of  this  enhancement  includes  the  additional  evaluation  of 
the  constraint  function  c  at  xk  +  pk  and  the  linear  algebra  required  to  calculate  the  step  pk 
from  (15.36). 

We  now  describe  an  algorithm  that  uses  a  merit  function  together  with  a  line-search 
strategy  and  a  second-order  correction  step.  We  assume  that  the  search  direction  pk  and  the 
penalty  parameter  ixk  are  computed  so  that  pk  is  a  descent  direction  for  the  merit  function, 
that  is,  D(cp(xk\  p)\  pk)  <  0.  In  Chapters  18  and  19,  we  discuss  how  to  accomplish  these 
goals.  The  key  feature  of  the  algorithm  is  that,  if  the  full  step  ak  —  1  does  not  produce 
satisfactory  descent  in  the  merit  function,  we  try  the  second-order  correction  step  before 
backtracking  along  the  original  direction  pk . 

Algorithm  15.2  (Generic  Alsorithm  with  Second-Order  Correction). 

Choose  parameters  r]  e  (0,  0.5)  and  ti,  t2  with  0  <  T\  <  Xj  <  1; 

Choose  initial  point  xo;  set  k  <-  0; 
repeat  until  a  convergence  test  is  satisfied: 

Compute  a  search  direction  pk\ 

Set  OLk  1,  newpoint  <—  false  ; 
while  newpoint  =  false 

if  4>(xk  +  akPk\  P-)  <  </>(**;  P)  +  WkD(<j)(xk\  p);  Pk) 

Set  Xk+\  xk  +akpk; 

Set newpoint  a-  true; 
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else  if  au  —  1 

Compute  pk  from  (15.36); 

if  <j>{xk  +  Pk  +  Pk ;  p)  <  <P(xk;  p)  +  r]D{<p(xk;  p);  Pk ) 

Set  xk+\  xk  +  Pk  +  Pk ; 

Setnewpoint  ■*—  true; 

else 

Choose  newat  in  [ti oik,  Tjotk]', 

end 

else 

Choose  newoi{  in  [ T\a.k ,  T2a,t]; 

end 
end  while 
end  repeat 

In  this  algorithm,  the  full  second-order  correction  step  pk  is  discarded  if  does  not 
produce  a  reduction  in  the  merit  function.  We  do  not  backtrack  along  the  direction 
Pk  +  Pk  because  it  is  not  guaranteed  to  be  a  descent  direction  for  the  merit  func¬ 
tion.  A  variation  of  this  algorithm  applies  the  second-order  correction  step  only  if  the 
sufficient  decrease  condition  (15.29)  is  violated  as  a  result  of  an  increase  in  the  norm  of  the 
constraints. 

The  second-order  correction  strategy  is  effective  in  practice.  The  cost  of  performing 
the  extra  constraint  function  evaluation  and  an  additional  backsolve  in  ( 15.36)  is  outweighed 
by  added  robustness  and  efficiency. 

NONMONOTONE  (WATCHDOG)  STRATEGY 

The  inefficiencies  caused  by  the  Maratos  effect  can  also  be  avoided  by  occasionally 
accepting  steps  that  increase  the  merit  function;  such  steps  are  called  relaxed  steps.  There 
is  a  limit  to  our  tolerance,  however.  If  a  sufficient  reduction  of  the  merit  function  has  not 
been  obtained  within  a  certain  number  of  iterates  of  the  relaxed  step  (f  iterates,  say),  then 
we  return  to  the  iterate  before  the  relaxed  step  and  perform  a  normal  iteration,  using  a  line 
search  or  some  other  technique  to  force  a  reduction  in  the  merit  function. 

In  contrast  with  the  second-order  correction,  which  aims  only  to  improve  satisfaction 
of  the  constraints,  this  nonmonotone  strategy  always  takes  regular  steps  pk  of  the  algorithm 
that  aim  both  for  improved  feasibility  and  optimality.  The  hope  is  that  any  increase  in  the 
merit  function  over  a  single  step  will  be  temporary,  and  that  subsequent  steps  will  more 
than  compensate  for  it. 

We  now  describe  a  particular  instance  of  the  nonmonotone  approach  called  the 
watchdog  strategy.  We  set  t  —  1,  so  that  we  allow  the  merit  function  to  increase  on  just 
a  single  step  before  insisting  on  a  sufficient  decrease  in  the  merit  function.  As  above,  we 
focus  our  discussion  on  a  line  search  algorithm  that  uses  a  nonsmooth  merit  function  </>. 
We  assume  that  the  penalty  parameter  p  is  not  changed  until  a  successful  cycle  has  been 
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completed.  To  simplify  the  notation,  we  omit  the  dependence  of  4>  on  /x  and  write  the  merit 
function  as  <f>(x)  and  the  directional  derivative  as  D(<p(x)\  pk). 

Algorithm  1 5.3  (Watchdos). 

Choose  a  constant  p  e  (0,  0.5)  and  an  initial  point  xo; 

Set  A  <-  0,  S  <-  {0}; 

repeat  until  a  termination  test  is  satisfied 

Compute  a  step  pk; 

Set  Xk+\  <-  xk  +  pk; 

if<t>(xk+i)  <  4>{xk)  +  pD((j)(xk);  pk) 

A-s— A+l,  S-s— <SU  {A'}; 

else 

Compute  a  search  direction  pk+ 1  from  xk+p. 

Find  ak+]  such  that 

<t>(xk+2)  <  <p(xk+l)  +  i]ak+lD((f>(xk+i)\  pk+ 1); 

Set.yt+2  •<-  xk+i  +  ak+\pk+\; 

if  <p(xk+i)  <  4>{xk)  or  <p(xk+2)  <  4>(xk)  +  pD(<p(xk);  pk) 

A-a-A  +  2,  S  <r-  SU  {A}; 

else  if  (p(xk+2)  >  <p(xk) 

(*  return  to  xk  and  search  along  pk  ><') 

Find  ak  such  that  <t)(xk+i)  <  4>(xk)  +  pa tkD(</>(xk);  pk); 

Compute  Xfc+3  =  xk  +  ak  pk\ 

Aa-A  +  3,  S  <-  S  U  {A}; 

else 

Compute  a  direction  pk+i  from  xk+2; 

Find  ak+2  such  that 

<P(xk+i)  <  <p(xk+2)  +  pak+2D((p(xk+2);  pk+2 ); 

Set  xk+2  xk+2  +  ak+2pk+2; 

AaA  +  3,  S<-iSU  {A}; 

end 

end 

end  (repeat) 

The  set  S  is  not  required  by  the  algorithm  and  is  introduced  only  to  identify  the  iterates 
for  which  a  sufficient  merit  function  reduction  was  obtained.  Note  that  at  least  a  third  of 
the  iterates  have  their  indices  in  S.  By  using  this  fact,  one  can  show  that  various  constrained 
optimization  methods  that  use  the  watchdog  technique  are  globally  convergent.  One  can 
also  show  that  for  all  sufficiently  large  A,  the  step  length  is  ak  =  1  and  the  convergence  rate 
is  superlinear. 

In  practice,  it  may  be  advantageous  to  allow  increases  in  the  merit  function  for  more 
than  one  iteration.  Values  of  t  such  as  5  or  8  are  typical.  As  this  discussion  indicates,  care¬ 
ful  implementations  of  the  watchdog  technique  have  a  certain  degree  of  complexity,  but 
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the  added  complexity  is  worthwhile  because  the  approach  has  good  practical  performance. 
A  potential  advantage  of  the  watchdog  technique  over  the  second-order  correction  strat¬ 
egy  is  that  it  may  require  fewer  evaluations  of  the  constraint  functions.  In  the  best  case, 
most  of  the  steps  will  be  full  steps,  and  there  will  rarely  be  a  need  to  return  to  an  earlier 
point. 


NOTES  AND  REFERENCES 

Techniques  for  eliminating  linear  constraints  are  described,  for  example,  in  Fletcher 
[101]  and  Gill,  Murray,  and  Wright  [  1 3 1  ] .  For  a  thorough  discussion  of  merit  functions  see 
Boggs  and  Tolle  [33]  and  Conn,  Gould,  and  Toint  [74].  Some  of  the  earliest  references  on 
nonmonotone  methods  include  Grippo,  Lampariello  and  Lucidi  [158],  and  Chamberlain  et 
al  [57];  see  [74]  for  a  review  of  nonmonotone  techniques  and  an  extensive  list  of  references. 
The  concept  of  a  filter  was  introduced  by  Fletcher  and  Leyffer  [105];  our  discussion  of 
filters  is  based  on  that  paper.  Second-order  correction  steps  are  motivated  and  discussed  in 
Fletcher  [101]. 


&  Exercises 

&  15.1  In  Example  15.1,  consider  these  three  choices  of  the  working  set:  W  —  {3}, 

W  —  { 1,  2},  W  —  {2,  3}.  Show  that  none  of  these  working  sets  are  the  optimal  active  set  for 
(15.5). 

&  15.2  For  the  problem  in  Example  15.3,  perform  simple  elimination  of  the  variables 

x2  and  X5  to  obtain  an  unconstrained  problem  in  the  remaining  variables  X\,  x$,  X4,  and 
v6.  Similarly  to  (15.12),  express  the  eliminated  variables  explicitly  in  terms  of  the  retained 
variables. 

&  15.3  Do  the  following  problems  have  solutions?  Explain. 

min  X\  +  X2  subject  to  x\  +  x\  =  2,  0  <  X\  <  1,  0  <  x2  <  1; 

min  x  1  +  x2  subject  to  x\  +  x\  <  1,  X\  +  x2  —  3; 

min  X\X2  subject  to  X\  +  x2  —  2. 

&  15.4  Show  that  if  in  Example  15.2  we  eliminate  x  in  terms  of  y,  then  the  correct 

solution  of  the  problem  is  obtained  by  performing  unconstrained  minimization. 

&  15.5  Show  that  the  basis  matrices  ( 15.15)  are  linearly  independent. 

&  15.6  Show  that  the  particular  solution  Q\R  TYlTb  of  Ax  —  b  is  identical  to 

AT{AAT)~lb. 
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1 5.7  In  this  exercise  we  compute  basis  matrices  that  attempt  to  compromise  between 
the  orthonormal  basis  (15.22)  and  simple  elimination  (15.15).  We  assume  that  the  basis 
matrix  is  given  by  the  first  m  columns  of  A,  so  that  P  =  /  in  (15.7),  and  define 


I 

-B-1N 

Y  = 

(B-1N)t 

,  z  = 

/ 

(15.37) 


(a)  Show  that  the  columns  of  Y  and  Z  are  no  longer  of  norm  1  and  that  the  relations 
AZ  =  0  and  YT Z  =  0  hold.  Therefore,  the  columns  of  Y  and  Z  form  a  linearly 
independent  set,  showing  that  ( 15.37)  is  a  valid  choice  of  the  basis  matrices. 

(b)  Show  that  the  particular  solution  Y (AY)-1  b  defined  by  this  choice  of  Y  is,  as  in  the 
orthogonal  factorization  approach,  the  minimum-norm  solution  of  Ax  —  b.  More 
specifically,  show  that 


Y(AY)-1  =  AT(AAr)-1. 

It  follows  that  the  matrix  Y (AY)-1  is  independent  of  the  choice  of  basis  matrix  B  in 
(15.7),  and  its  conditioning  is  determined  by  that  of  A  alone.  (Note,  however,  that  the 
matrix  Z  still  depends  explicitly  on  B,  so  a  careful  choice  of  B  is  needed  to  ensure  well 
conditioning  in  this  part  of  the  computation.) 

i#7  15.8  Verify  that  by  adding  the  inequality  constraint  3xi  +  2x3  >  1  to  the  problem 

(15.11),  the  elimination  (15.12)  transforms  the  problem  into  one  of  minimizing  the  function 
(15.13)  subject  to  the  inequality  constraint  (15.23). 


Chapter 


Quadratic 

Programming 


An  optimization  problem  with  a  quadratic  objective  function  and  linear  constraints  is  called 
a  quadratic  program.  Problems  of  this  type  are  important  in  their  own  right,  and  they  also 
arise  as  subproblems  in  methods  for  general  constrained  optimization,  such  as  sequential 
quadratic  programming  (Chapter  18),  augmented  Lagrangian  methods  (Chapter  17),  and 
interior-point  methods  (Chapter  19). 
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The  general  quadratic  program  (QP)  can  be  stated  as 


min 

X 

q(x)  =  \x 

tGx + xTc 

(16.1a) 

subject  to 

cif  x  —  bj, 

i  e  £, 

(16.1b) 

Al 

i  e  1, 

(16.1c) 

where  G  is  a  symmetric  n  x  n  matrix,  £  and  2  are  finite  sets  of  indices,  and  c,  x,  and 
{a,},  i  e  £  Ul,  are  vectors  in  R".  Quadratic  programs  can  always  be  solved  (or  shown 
to  be  infeasible)  in  a  finite  amount  of  computation,  but  the  effort  required  to  find  a 
solution  depends  strongly  on  the  characteristics  of  the  objective  function  and  the  number 
of  inequality  constraints.  If  the  Hessian  matrix  G  is  positive  semidefinite,  we  say  that  (16.1) 
is  a  convex  QP,  and  in  this  case  the  problem  is  often  similar  in  difficulty  to  a  linear  program. 
( Strictly  convex  QPs  are  those  in  which  G  is  positive  definite.)  Nonconvex  QPs,  in  which  G 
is  an  indefinite  matrix,  can  be  more  challenging  because  they  can  have  several  stationary 
points  and  local  minima. 

In  this  chapter  we  focus  primarily  on  convex  quadratic  programs.  We  start  by 
considering  an  interesting  application  of  quadratic  programming. 


□  Example  16.1  (Portfolio  Optimization) 


Every  investor  knows  that  there  is  a  tradeoff  between  risk  and  return:  To  increase  the 
expected  return  on  investment,  an  investor  must  be  willing  to  tolerate  greater  risks.  Portfolio 
theory  studies  how  to  model  this  tradeoff  given  a  collection  of  n  possible  investments  with 
returns  r,,i  —  1,2 The  returns  r,  are  usually  not  known  in  advance  and  are  often 
assumed  to  be  random  variables  that  follow  a  normal  distribution.  We  can  characterize  these 
variables  by  their  expected  value  /z,-  =  E[r,-]  and  their  variance  af  —  E[(r,  —  /z,)2].  The 
variance  measures  the  fluctuations  of  the  variable  r,  about  its  mean,  so  that  larger  values 
of  (7;  indicate  riskier  investments.  The  returns  are  not  in  general  independent,  and  we  can 
define  correlations  between  pairs  of  returns  as  follows: 


PU  = 


E[{n  ~  Pi)(rj  ~  Mj)] 

a,Gj 


for  i ,  j  =  1,2,...,  n. 


The  correlation  measures  the  tendency  of  the  return  on  investments  i  and  j  to  move  in  the 
same  direction.  Two  investments  whose  returns  tend  to  rise  and  fall  together  have  a  positive 
correlation;  the  nearer  p,j  is  to  1,  the  more  closely  the  two  investments  track  each  other. 
Investments  whose  returns  tend  to  move  in  opposite  directions  have  a  negative  correlation. 

An  investor  constructs  a  portfolio  by  putting  a  fraction  x,  of  the  available  funds  into 
investment  i,  for  i  =  1,2 Assuming  that  all  available  funds  are  invested  and  that 
short-selling  is  not  allowed,  the  constraints  are  Y^=i  xi  ~  1  and  x  >  0.  The  return  on  the 
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portfolio  is  given  by 


n 

R  =  'Y^,xiri-  (16.2) 

i=i 

To  measure  the  desirability  of  the  portfolio,  we  need  to  obtain  measures  of  its  expected 
return  and  variance.  The  expected  return  is  simply 


£[/?]  =  E 


J2x‘r‘ 

_i= 1 


n 

^C£[r,]  =  xT  p, 

i= 1 


while  the  variance  is  given  by 


Var [R]  =  E[(R  -  £[fl])2]  =  UjdjPij  —  xT  Gx, 

/=!  j= i 


where  the  n  x  n  symmetric  positive  semidehnite  matrix  G  defined  by 


Gij  —  PijOiOj 


is  called  the  covariance  matrix. 

Ideally,  we  would  like  to  find  a  portfolio  for  which  the  expected  return  xT  p  is  large 
while  the  variance  xT Gx  is  small.  In  the  model  proposed  by  Markowitz  [201],  we  combine 
these  two  aims  into  a  single  objective  function  with  the  aid  of  a  “risk  tolerance  parameter” 
denoted  by  at,  and  we  solve  the  following  problem  to  find  the  optimal  portfolio: 


n 

max  xT  p  —  kxtGx,  subject  to  y^x,-  =  1,  x  >  0. 

i=i 

The  value  chosen  for  the  nonnegative  parameter  k  depends  on  the  preferences  of  the 
individual  investor.  Conservative  investors,  who  place  more  emphasis  on  minimizing  risk 
in  their  portfolio,  would  choose  a  large  value  of  k  to  increase  the  weight  of  the  variance 
measure  in  the  objective  function.  More  daring  investors,  who  are  prepared  to  take  on  more 
risk  in  the  hope  of  a  higher  expected  return,  would  choose  a  smaller  value  of  k. 

The  difficulty  in  applying  this  portfolio  optimization  technique  to  real-life  investing 
lies  in  defining  the  expected  returns,  variances,  and  correlations  for  the  investments  in 
question.  Financial  professionals  often  combine  historical  data  with  their  own  insights  and 
expectations  to  produce  values  of  these  quantities. 


16.1.  Equality-Constrained  Quadratic  Programs  451 


16.1  EQUALITY-CONSTRAINED  QUADRATIC  PROGRAMS 


We  begin  our  discussion  of  algorithms  for  quadratic  programming  by  considering  the 
case  in  which  only  equality  constraints  are  present.  Techniques  for  this  special  case  are 
applicable  also  to  problems  with  inequality  constraints  since,  as  we  see  later  in  this  chapter, 
some  algorithms  for  general  QP  require  the  solution  of  an  equality-constrained  QP  at  each 
iteration. 

PROPERTIES  OF  EQUALITY-CONSTRAINED  QPs 

For  simplicity,  we  write  the  equality  constraints  in  matrix  form  and  state  the  equality- 
constrained  QP  as  follows: 


min  q(x)  ==  \T  Gx  +  xT  c  (16.3a) 

subject  to  Ax  =  b,  (16.3b) 

where  A  is  the  m  x  n  Jacobian  of  constraints  (with  m  <  n)  whose  rows  are  aj ,  i  e  £  and 
b  is  the  vector  in  R"!  whose  components  are  bj,  i  e  £.  For  the  present,  we  assume  that  A 
has  full  row  rank  (rank  m)  so  that  the  constraints  (16.3b)  are  consistent.  (In  Section  16.8 
we  discuss  the  case  in  which  A  is  rank  deficient.) 

The  first-order  necessary  conditions  for  x*  to  be  a  solution  of  (16.3)  state  that  there  is 
a  vector  A*  such  that  the  following  system  of  equations  is  satisfied: 


(16.4) 


These  conditions  are  a  consequence  of  the  general  result  for  first-order  optimality  conditions, 
Theorem  12.1.  As  in  Chapter  12,  we  call  A*  the  vector  of  Lagrange  multipliers.  The  system 

(16.4)  can  be  rewritten  in  a  form  that  is  useful  for  computation  by  expressing  x*  as  x*  = 
x  +  p,  where  x  is  some  estimate  of  the  solution  and  p  is  the  desired  step.  By  introducing 
this  notation  and  rearranging  the  equations,  we  obtain 


where 


1 

a 

1 - 

E-h 

- 1 

1 

_ 1 

1 

1 _ 

_ 1 

0  J 

1 

* 

«-< 

_ 1 

1 _ 

h  =  Ax  —  b,  g  —  c  +  Gx,  p  —  x*  —  x. 


(16.5) 


(16.6) 


The  matrix  in  (16.5)  is  called  the  Karush-Kuhn-Tucker  (KKT)  matrix,  and  the  fol¬ 
lowing  result  gives  conditions  under  which  it  is  nonsingular.  As  in  Chapter  15,  we  use  Z  to 


452  Chapter  16.  Quadratic  Programming 


denote  the  n  x  ( n  —  m )  matrix  whose  columns  are  a  basis  for  the  null  space  of  A.  That  is, 
Z  has  full  rank  and  satisfies  AZ  =  0. 

Lemma  16.1. 

Let  A  have  full  row  rank,  and  assume  that  the  reduced-Hessian  matrix  ZT  GZ  is  positive 
definite.  Then  the  KKT  matrix 


K  = 


G  At 
A  0 


(16.7) 


is  nonsingular,  and  hence  there  is  a  unique  vector  pair  (. x *,  A,*)  satisfying  (16.4). 
PROOF.  Suppose  there  are  vectors  w  and  v  such  that 


r  g  at  _ 

1 

s 

_ 1 

1 

0 

_ 1 

Since  Aw  —  0,  we  have  from  (16.8)  that 


U) 

T 

G 

At  ' 

U) 

V 

A 

0 

V 

wT  Gw. 


(16.8) 


Since  w  lies  in  the  null  space  of  A,  it  can  be  written  as  w  —  Zu  for  some  vector  u  e  R"  m 
Therefore,  we  have 


0  =  wT  Gw  —  uT  ZT  GZu, 


which  by  positive  definiteness  of  ZT GZ  implies  that  u  —  0.  Therefore,  w  =  0,  and  by 

(16.8) ,  At v  —  0.  Full  row  rank  of  A  then  implies  that  v  =  0.  We  conclude  that  equation 

( 16.8)  is  satisfied  only  if  w  —  0  and  v  —  0,  so  the  matrix  is  nonsingular,  as  claimed.  □ 


□  Example  1 6.2 

Consider  the  quadratic  programming  problem 


min  q(x)  —  3xj  +  2xiX2  +  X\X$  +  2.5xj  +  2x2X3  +  2x\  —  8x1  —  3x2  —  3x3, 

subject  to  X\  +  X3  —  3,  x?  +  X3  =  0.  (16.9) 
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We  can  write  this  problem  in  the  form  ( 16.3)  by  defining 


G  = 


6 

2 

1 

-8 

2 

5 

2 

,  C  = 

-3 

1 

2 

4 

-3 

1  0  1 

3 

A  = 

0  1  1 

,  b  — 

0 

The  solution  x*  and  optimal  Lagrange  multiplier  vector  A*  are  given  by 
x*  =  (2, -1,  l)r,  A*  =  (3,  -2)t. 

In  this  example,  the  matrix  G  is  positive  definite,  and  the  null-space  basis  matrix  can  be 
defined  as  in  (15.15),  giving 


Z  =  (-1,  -1,  l)r. 


(16.10) 

□ 


We  have  seen  that  when  the  conditions  of  Lemma  16.1  are  satisfied,  there  is  a  unique 
vector  pair  (x*.  A*)  that  satisfies  the  first-order  necessary  conditions  for  (16.3).  In  fact,  the 
second-order  sufficient  conditions  (see  Theorem  12.6)  are  also  satisfied  at  (x*.  A*),  so  x*  is 
a  strict  local  minimizer  of  ( 16.3).  In  fact,  we  can  use  a  direct  argument  to  show  that  x*  is  a 
global  solution  of  (16.3). 

Theorem  16.2. 

Let  A  have  full  row  rank  and  assume  that  the  reduced-Hessian  matrix  ZT  GZ  is  positive 
definite.  Then  the  vector  x*  satisfying  (16.4)  is  the  unique  global  solution  of  (16.3). 

Proof.  Let  x  be  any  other  feasible  point  (satisfying  Ax  —  b),  and  as  before,  let  p  denote 
the  difference  x*  —  x.  Since  Ax *  =  Ax  —  b,  we  have  that  Ap  —  0.  By  substituting  into  the 
objective  function  (16.3a),  we  obtain 

q(x)  —  \(x*  -  p)T  G(x*  —  p)  +  cT{x*  -  p) 

—  \pTGp  —  pT  Gx*  —  cT  p  +  q(x*).  (16.11) 

From  (16.4)  we  have  that  Gx *  =  —  c  +  AT A*,  so  from  Ap  =  0  we  have  that 

pT  Gx*  =  pT{—c  +  At  A*)  =  —pTc. 

By  substituting  this  relation  into  ( 1 6. 1 1 ) ,  we  obtain 


q(x)  =  \pT  Gp  +  q(x*). 
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Since  p  lies  in  the  null  space  of  A,  we  can  write  p  =  Zu  for  some  vector  u  e  R"_m,  so  that 

q(x)  —  \uT  ZT  GZu  +  q(x*). 

By  positive  definiteness  of  ZT GZ,  we  conclude  that  q(x)  >  q(x*)  except  when  u  —  0,  that 
is,  when  x  —  x*.  Therefore,  x*  is  the  unique  global  solution  of  (16.3).  □ 

When  the  reduced  Hessian  matrix  ZT GZ  is  positive  semidefinite  with  zero  eigenval¬ 
ues,  the  vector  x*  satisfying  ( 16.4)  is  a  local  minimizer  but  not  a  strict  local  minimizer.  If 
the  reduced  Hessian  has  negative  eigenvalues,  then  x*  is  only  a  stationary  point,  not  a  local 
minimizer. 


1 6.2  DIRECT  SOLUTION  OF  THE  KKT  SYSTEM 


In  this  section  we  discuss  efficient  methods  for  solving  the  KKT  system  (16.5).  The  first 
important  observation  is  that  if  m  >  1,  the  KKT  matrix  is  always  indefinite.  We  define  the 
inertia  of  a  symmetric  matrix  K  to  be  the  scalar  triple  that  indicates  the  numbers  n+,  n_, 
and  «  o  of  positive,  negative,  and  zero  eigenvalues,  respectively,  that  is, 

inertia(K)  =  (n+,  h_,  n o). 

The  following  result  characterizes  the  inertia  of  the  KKT  matrix. 

Theorem  16.3. 

Let  K  be  defined  by  (16.7),  and  suppose  that  A  has  rankm.  Then 
inertia) K)  =  inertia (ZT GZ)  +  ( m ,  m,  0). 

Therefore,  ifZTGZ  is  positive  definite,  inertia(K)  =  (n,  m,  0). 

The  proof  of  this  result  is  given  in  [111],  for  example.  Note  that  the  assumptions  of 
this  theorem  are  satisfied  by  Example  16.2.  Hence,  if  we  construct  the  5x5  matrix  K  using 
the  data  of  this  example,  we  obtain  inertia(K)  =  (3,  2,  0). 

Knowing  that  the  KKT  system  is  indefinite,  we  now  describe  the  main  direct  techniques 
used  to  solve  (16.5). 

FACTORING  THE  FULL  KKT  SYSTEM 

One  option  for  solving  (16.5)  is  to  perform  a  triangular  factorization  on  the  full  KKT 
matrix  and  then  perform  backward  and  forward  substitution  with  the  triangular  factors. 
Because  of  indefiniteness,  we  cannot  use  the  Cholesky  factorization.  We  could  use  Gaussian 
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elimination  with  partial  pivoting  (or  a  sparse  variant  thereof)  to  obtain  the  L  and  U  factors, 
but  this  approach  has  the  disadvantage  that  it  ignores  the  symmetry. 

The  most  effective  strategy  in  this  case  is  to  use  a  symmetric  indefinite  factorization, 
which  we  have  discussed  in  Chapter  3  and  the  Appendix.  For  a  general  symmetric  matrix 
K,  this  factorization  has  the  form 


PtKP  =  LBLt,  (16.12) 

where  P  is  a  permutation  matrix,  L  is  unit  lower  triangular,  and  B  is  block- diagonal  with 
either  1  x  1  or  2  x  2  blocks.  The  symmetric  permutations  defined  by  the  matrix  P  are 
introduced  for  numerical  stability  of  the  computation  and,  in  the  case  of  large  sparse  K, 
for  maintaining  sparsity.  The  computational  cost  of  the  symmetric  indefinite  factorization 
(16.12)  is  typically  about  half  the  cost  of  sparse  Gaussian  elimination. 

To  solve  (16.5),  we  first  compute  the  factorization  (16.12)  of  the  coefficient  matrix. 
We  then  perform  the  following  sequence  of  operations  to  arrive  at  the  solution: 

to  obtain  z; 

to  obtain  v; 
to  obtain  z; 


Since  multiplications  with  the  permutation  matrices  P  and  P  T  can  be  performed  by  simply 
rearranging  vector  components,  they  are  inexpensive.  Solution  of  the  system  Bz  —  z  entails 
solving  a  number  of  small  lxl  and  2x2  systems,  so  the  number  of  operations  is  a  small 
multiple  of  the  system  dimension  ( m  +  n ),  again  inexpensive.  Triangular  substitutions  with 
L  and  LT  are  more  costly.  Their  precise  cost  depends  on  the  amount  of  sparsity,  but  is 
usually  significantly  less  than  the  cost  of  performing  the  factorization  (16.12). 

This  approach  of  factoring  the  full  [n  +  m)  x  (n  +  m)  KKT  matrix  (16.7)  is  quite 
effective  on  many  problems.  It  may  be  expensive,  however,  when  the  heuristics  for  choosing 
the  permutation  matrix  P  are  not  able  to  maintain  sparsity  in  the  L  factor,  so  that  L  becomes 
much  more  dense  than  the  original  coefficient  matrix. 

SCHUR-COMPLEMENT  METHOD 

Assuming  that  G  is  positive  definite,  we  can  multiply  the  first  equation  in  (16.5)  by 
AG”1  and  then  subtract  the  second  equation  to  obtain  a  linear  system  in  the  vector  X*  alone: 


solve  Lz  —  P1 

solve  Bz  =  z 
solve  LTz  —  z 


8 

h 


set 


~P 

X* 


=  PI. 


{AG~lAT)X*  =  (AG~lg  —  h). 


(16.13) 
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We  solve  this  symmetric  positive  definite  system  for  X*  and  then  recover  p  from  the  first 
equation  in  ( 16.5)  by  solving 


Gp  =  AtX*  -  g.  (16.14) 

This  approach  requires  us  to  perform  operations  with  G_1,  as  well  as  to  compute  the 
factorization  of  the  m  x  m  matrix  AG~l  AT .  Therefore,  it  is  most  useful  when: 

•  G  is  well  conditioned  and  easy  to  invert  (for  instance,  when  G  is  diagonal  or  block- 
diagonal);  or 

•  G”1  is  known  explicitly  through  a  quasi-Newton  updating  formula;  or 

•  the  number  of  equality  constraints  m  is  small,  so  that  the  number  of  backsolves  needed 
to  form  the  matrix  AG~l  AT  is  not  too  large. 

The  name  “Schur-Complement  method”  derives  from  the  fact  that,  by  applying  block 
Gaussian  elimination  to  (16.7)  using  G  as  the  pivot,  we  obtain  the  block  upper  triangular 
system 


G  At 
0  - AG~lAr 


(16.15) 


In  linear  algebra  terminology,  the  matrix  AG~l  AT  is  the  Schur  complement  of  G  in  the 
matrix  K  of  (16.7).  By  applying  this  block  elimination  technique  to  the  system  (16.5),  and 
performing  a  block  backsolve,  we  obtain  (16.13),  (16.14). 

We  can  use  an  approach  like  the  Schur-complement  method  to  derive  an  explicit 
inverse  formula  for  the  KKT  matrix  in  (16.5).  This  formula  is 


with 


G 

At  ' 

-1 

c 

E 

A 

0 

et 

F 

(16.16) 


C  =  G-1  —  G~l  AT  {AG~l  Ar)~l  AG~\ 
E  =  G~lAT(AG~lAT)~l, 

F  =  —(AG~1Ar)~l. 


The  solution  of  ( 16.5)  can  be  obtained  by  multiplying  its  right-hand  side  by  this  inverse 
matrix.  If  we  take  advantage  of  common  expressions,  and  group  the  terms  appropriately, 
we  recover  the  approach  (16.13),  (16.14). 


16.2.  Direct  Solution  of  the  KKT  System  457 


NULL-SPACE  METHOD 

The  null-space  method  does  not  require  nonsingularity  of  G  and  therefore  has  wider 
applicability  than  the  Schur-complement  method.  It  assumes  only  that  the  conditions  of 
Lemma  16.1  hold,  namely,  that  A  has  full  row  rank  and  that  ZT GZ  is  positive  definite. 
However,  it  requires  knowledge  of  the  null-space  basis  matrix  Z.  Like  the  Schur-complement 
method,  it  exploits  the  block  structure  in  the  KKT  system  to  decouple  (16.5)  into  two  smaller 
systems. 

Suppose  that  we  partition  the  vector  p  in  (16.5)  into  two  components,  as  follows: 

p  —  Y  pY  +  Zpz,  (16.17) 

where  Z  is  the  n  x  ( n  —  m)  null-space  matrix,  Y  is  any  n  x  m  matrix  such  that  [Y  \  Z] 
is  nonsingular,  pY  is  an  m -vector,  and  pz  is  an  (n  —  m)-vector.  The  matrices  Y  and  Z 
were  discussed  in  Section  15.3,  where  Figure  15.4  shows  that  Yxz  is  a  particular  solution  of 
Ax  —  b,  while  Zxz  is  a  displacement  along  these  constraints. 

By  substituting  p  into  the  second  equation  of  (16.5)  and  recalling  that  AZ  =  0,  we 
obtain 


(AY)pY  —  —h.  (16.18) 

Since  A  has  rank  m  and  [F  |  Z]  is  n  x  n  nonsingular,  the  product  A[Y  \  Z]  —  [AY  |  0]  has 
rank  m.  Therefore,  AT  is  a  nonsingular  m  x  m  matrix,  and  pY  is  well  determined  by  the 
equations  (16.18).  Meanwhile,  we  can  substitute  (16.17)  into  the  first  equation  of  (16.5)  to 
obtain 


-GYPy-  GZpz  +  AtX*  =  g 


and  multiply  by  ZT  to  obtain 

(ZT  GZ)pz  —  —ZtGYPy  —  ZT  g.  (16.19) 

This  system  can  be  solved  by  performing  a  Cholesky  factorization  of  the  reduced-Hessian 
matrix  ZT GZ  to  determine  pz.  We  therefore  can  compute  the  total  step  p  =  Y pY  +  Zpz. 
To  obtain  the  Lagrange  multiplier,  we  multiply  the  first  block  row  in  (16.5)  by  YT  to  obtain 
the  linear  system 


(AY)tX*  =  YT(g  +  Gp), 


(16.20) 


which  can  be  solved  for  A,*. 
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□  Example  1 6.3 


Consider  the  problem  (16.9)  given  in  Example  16.2.  We  can  choose 


Y  = 


2/3  -1/3 

-1/3  2/3 

1/3  1/3 


and  set  Z  as  in  (16.10).  Note  that  AY  =  I . 

Suppose  we  have  x  =  (0,  0,  0)r  in  ( 16.6).  Then 


h  =  Ax  —  b  —  —b. 


g  —  c  +  Gx  —  c  — 


-8 

-3 

-3 


Simple  calculation  shows  that 


so  that 


Pz  =  [  0  ]  . 


p  —  x*  —  x  —  Y  pY  +  Zpz  = 


2 

-1 

1 


After  recovering  1*  from  (16.20),  we  conclude  that 


□ 


The  null-space  approach  can  be  very  effective  when  the  number  of  degrees  of  freedom 
n  —  m  is  small.  Its  main  limitation  lies  in  the  need  for  the  null-space  matrix  Z  which,  as  we 
have  seen  in  Chapter  15,  can  be  expensive  to  compute  in  some  large  problems.  The  matrix  Z 
is  not  uniquely  defined  and,  if  it  is  poorly  chosen,  the  reduced  system  (16.19)  may  become  ill 
conditioned.  If  we  choose  Z  to  have  orthonormal  columns,  as  is  normally  done  in  software 
for  small  and  medium-sized  problems,  then  the  conditioning  of  ZT  GZ  is  at  least  as  good 
as  that  of  G  itself.  When  A  is  large  and  sparse,  however,  an  orthonormal  Z  is  expensive  to 
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compute,  so  for  practical  reasons  we  are  often  forced  to  use  one  of  the  less  reliable  choices 
of  Z  described  in  Chapter  15. 

It  is  difficult  to  give  hard  and  fast  rules  about  the  relative  effectiveness  of  null-space  and 
Schur-complement  methods,  because  factors  such  as  fill-in  during  computation  of  Z  vary 
significantly  even  among  problems  of  the  same  dimension.  In  general,  we  can  recommend 
the  Schur-complement  method  if  G  is  positive  definite  and  AG~lAT  can  be  computed 
relatively  cheaply  (because  G  is  easy  to  invert  or  because  m  is  small  relative  ton).  Otherwise, 
the  null-space  method  is  often  preferable,  in  particular  when  it  is  much  more  expensive  to 
compute  factors  of  G  than  to  compute  the  null-space  matrix  Z  and  the  factors  of  ZT GZ. 


1 6.3  ITERATIVE  SOLUTION  OF  THE  KKT  SYSTEM 


An  alternative  to  the  direct  factorization  techniques  discussed  in  the  previous  section  is  to 
use  an  iterative  method  to  solve  the  KKT  system  (16.5).  Iterative  methods  are  suitable  for 
solving  very  large  systems  and  often  lend  themselves  well  to  parallelization.  The  conjugate 
gradient  (CG)  method  is  not  recommended  for  solving  the  full  system  (16.5)  as  written, 
because  it  can  be  unstable  on  systems  that  are  not  positive  definite.  Better  options  are 
Krylov  methods  for  general  linear  or  symmetric  indefinite  systems.  Candidates  include 
the  GMRES,  QMR,  and  LSQR  methods;  see  the  Notes  and  References  at  the  end  of  the 
chapter.  Other  iterative  methods  can  be  derived  from  the  null-space  approach  by  applying 
the  conjugate  gradient  method  to  the  reduced  system  (16.19).  Methods  of  this  type  are  key 
to  the  algorithms  of  Chapters  18  and  19,  and  are  discussed  in  the  remainder  of  this  section. 
We  assume  throughout  that  ZT GZ  is  positive  definite. 

CG  APPLIED  TO  THE  REDUCED  SYSTEM 

We  begin  our  discussion  of  iterative  null-space  methods  by  deriving  the  underlying 
equations  in  the  notation  of  the  equality-constrained  QP  (16.3).  Expressing  the  solution  of 
the  quadratic  program  (16.3)  as 


x*  —  Yx  Y  +  Zx7,  (16.21) 

for  some  vectors  xz  e  R'!_m,  xY  e  Rm,  the  constraints  Ax  =  b  yield 

AY  x  Y  =  b,  (16.22) 

which  determines  the  vector  xY.  In  Chapter  15,  various  practical  choices  of  Y  are  described, 
some  of  which  allow  (16.22)  to  be  solved  economically.  Substituting  (16.21)  into  ( 16.3),  we 
see  that  xz  solves  the  unconstrained  reduced  problem 

min  \xz  ZT GZxz  +  xzT cz, 
xz  2 
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where 


cz  =  ZtGYxy  +  ZTc. 


(16.23) 


The  solution  xz  satisfies  the  linear  system 

ZT  GZxz  —  —cz-  (16.24) 

Since  ZT  GZ  is  positive  definite,  we  can  apply  the  CG  method  to  this  linear  system  and 
substitute  xz  into  (16.21)  to  obtain  a  solution  of  (16.3). 

As  discussed  in  Chapter  5,  preconditioning  can  improve  the  rate  of  convergence  of 
the  CG  iteration,  so  we  assume  that  a  preconditioner  Wzz  is  given.  The  preconditioned  CG 
method  (Algorithm  5.3)  applied  to  the  ( n  —  m) -dimensional  reduced  system  (16.24)  is  as 
follows.  (We  denote  the  steps  produced  by  the  CG  iteration  by  dz.) 

Algorithm  16.1  (Preconditioned  CG  for  Reduced  Systems). 

Choose  an  initial  point  xz; 

Compute  rz  =  ZT GZxz  +  cz,  gz  —  Wzz-1rz,  and  dz  —  -gz; 

repeat 


a  <  rzT  gz/dzr  ZT  GZdz, 

(16.25a) 

xz  •*-  xz  +  adz; 

(16.25b) 

rz  <—  rz  +  aZT GZdz; 

(16.25c) 

Sz+  x-  Wzz-Vz+; 

(16.25d) 

P  {fz+)T gz  lrz  gz; 

(16.25e) 

dz  gz  T*  fidz\ 

(16.25f) 

gz  <r-  gz\  rz  A-  rz\ 

(16.25g) 

until  a  termination  test  is  satisfied. 

This  iteration  may  be  terminated  when,  for  example,  rzT  Wzz~lrz  is  sufficiently  small. 

In  this  approach,  it  is  not  necessary  to  form  the  reduced  Hessian  ZT GZ  explicitly 
because  the  CG  method  requires  only  that  we  compute  matrix-vector  products  involving 
this  matrix.  In  fact,  it  is  not  even  necessary  to  form  Z  explicitly  as  long  as  we  are  able  to 
compute  products  of  Z  and  ZT  with  arbitrary  vectors.  For  some  choices  of  Z,  these  products 
are  much  cheaper  to  compute  than  Z  itself,  as  we  have  seen  in  Chapter  15. 

The  preconditioner  Wzz  is  a  symmetric,  positive  definite  matrix  of  dimension  n  —  m, 
which  might  be  chosen  to  cluster  the  eigenvalues  of  Wzz_1/,2(Z:rGZ)VTzz_1^2  and  to  reduce 
the  span  between  the  smallest  and  largest  eigenvalues.  An  ideal  choice  of  preconditioner  is 
one  for  which  Wzz_1/2(ZrGZ)lTzz_1/2  =  /,  that  is,  Wzz  —  ZT  GZ.  Motivated  by  this  ideal, 
we  consider  preconditioners  of  the  form 


Wzz  =  ZtHZ, 


(16.26) 
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where  H  is  a  symmetric  matrix  such  that  ZT  HZ  is  positive  definite.  Some  choices  of  H  are 
discussed  below.  Preconditioners  of  the  form  (16.26)  allow  us  to  apply  the  CG  method  in 
n  -dimensional  space,  as  we  discuss  next. 


THE  PROJECTED  CG  METHOD 

It  is  possible  to  design  a  modification  of  the  Algorithm  16.1  that  avoids  operating  with 
the  null-space  basis  Z,  provided  we  use  a  preconditioner  of  the  form  (16.26)  and  a  particular 
solution  of  the  equation  Ax  —  b.  This  approach  works  implicitly  with  an  orthogonal  matrix 
Z  and  is  not  affected  by  ill  conditioning  in  A  or  by  a  poor  choice  of  Z. 

After  the  solution  xz  of  (16.24)  has  been  computed  by  using  Algorithm  16.1,  it 
must  be  multiplied  by  Z  and  substituted  in  (16.21)  to  give  the  solution  of  the  quadratic 
program  (16.3).  Alternatively,  we  may  rewrite  Algorithm  16.1  to  work  directly  with  the 
vector  x  —  Zxz  +  Yxy,  where  the  Yxy  term  is  fixed  at  the  start  and  the  xz  term  is  updated 
(implicitly)  within  each  iteration.  To  specify  this  form  of  the  CG  algorithm,  we  introduce 
the  n -vectors  x,r,g,  and  d,  which  satisfy  x  =  Zxz+Yxy,  Zt r  —  rz,g  =  Zgz,  and  d  =  Zdz, 
respectively.  We  also  define  the  scaled  n  x  n  projection  matrix  P  as  follows: 

P  =  Z{ZT  HZ)~lZT ,  (16.27) 

where  H  is  the  preconditioning  matrix  from  (16.26).  The  CG  iteration  in  n -dimensional 
space  can  be  specified  as  follows. 

Algorithm  16.2  (Projected  CG  Method). 

Choose  an  initial  point  x  satisfying  Ax  —  b; 

Compute  r  —  Gx  +  c,  g  —  Pr,  and  d  =  —g; 

repeat 


rT  g/dT  Gd; 

(16.28a) 

x  +  ad; 

(16.28b) 

r  +  aGd ; 

(16.28c) 

Pr+; 

(16.28d) 

(r+)Tg+/rTg; 

(16.28e) 

~8+  +  fid; 

(16.28f) 

g+;  r  r+; 

(16.28g) 

until  a  convergence  test  is  satisfied. 
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A  practical  stop  test  is  to  terminate  when  rT g  —  rT  Pr  is  smaller  than  a  prescribed 
tolerance. 

Note  that  the  vector  g+,  which  we  call  the  preconditioned  residual,  has  been  defined  to 
be  in  the  null  space  of  A.  As  a  result,  in  exact  arithmetic,  all  the  search  directions  d  generated 
by  Algorithm  16.2  also  lie  in  the  null  space  of  A,  and  thus  the  iterates  x  all  satisfy  Ax  —  b. 
It  is  not  difficult  to  verify  (see  Exercise  16.14)  that  the  iteration  is  well  defined  if  ZT GZ 
and  ZT HZ  are  positive  definite.  The  reader  can  also  verify  that  the  iterates  x  generated  by 
Algorithm  16.2  are  related  to  the  iterates  xz  of  Algorithm  16.1  via  (16.21). 

Two  simple  choices  of  the  preconditioning  matrix  H  are  H  =  diag(|G;(- 1)  and  H  =  I. 
In  some  applications,  it  is  effective  to  define  H  as  a  block  diagonal  submatrix  of  G. 

Algorithm  16.2  makes  use  of  the  null-space  basis  Z  only  through  the  operator  (16.27). 
It  is  possible,  however,  to  compute  Pr  without  knowing  a  representation  of  the  null-space 
basis  Z.  For  simplicity,  we  first  consider  the  case  in  which  H  —  I,  so  that  P  is  the  orthogonal 
projection  operator  onto  the  null  space  of  A.  We  use  P,  to  denote  this  special  case  of  P, 
that  is, 


Pl  =  Z(Z7'Z)”1Zr.  (16.29) 

The  computation  of  the  preconditioned  residual  g+  =  Ptr+  in  (16.28d)  can  be  performed 
in  two  ways.  The  first  is  to  express  P,  by  the  equivalent  formula 

Pl  —  I  —  AT  (AAT)~l  A  (16.30) 


and  thus  compute  g+  =  PTr+ .  We  can  then  write  g+  —  r+  —  AT  i>+ ,  where  i>+  is  the  solution 
of  the  system 

AAtv+  =  Ar+.  (16.31) 


This  approach  for  computing  the  projection  g+  =  PTr+  is  called  the  normal  equa¬ 
tions  approach;  the  system  (16.31)  can  be  solved  by  using  a  Cholesky  factorization 
of  AAT . 

The  second  approach  is  to  express  the  projection  (16.28d)  as  the  solution  of  the 
augmented  system 


(16.32) 


which  can  be  solved  by  means  of  a  symmetric  indefinite  factorization,  as  discussed  earlier. 
We  call  this  approach  the  augmented  system  approach. 
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We  suppose  now  that  the  preconditioning  has  the  general  form  of  ( 16.27)  and  ( 16.28d). 
When  H  is  nonsingular,  we  can  compute  g+  as  follows: 

g+  =  Pr+,  where  P  =  H~l  (/  -  Ar (AH^1  AT)~l AH^1) .  (16.33) 

Otherwise,  when  zJ  Hz  ^  0  for  all  nonzero  z  with  Az  =  0,  we  can  find  g+  as  the  solution 
of  the  system 


H 

At  ' 

8+ 

r+ 

A 

0 

v+ 

0 

(16.34) 


While  (16.33)  is  unappealing  when  H does  not  have  a  simple  form,  (16.34)  is  a  useful 
generalization  of  (16.32).  A  “perfect”  preconditioner  is  obtained  by  taking  H  —  G,  but 
other  choices  for  H  are  also  possible,  provided  that  ZT HZ  is  positive  definite.  The  matrix 
in  (16.34)  is  often  called  a  constraint  preconditioner. 

None  of  these  procedures  for  computing  the  projection  makes  use  of  a  null-space 
basis  Z ;  only  the  factorization  of  matrices  involving  A  is  required.  Significantly,  all  these 
forms  allow  us  to  compute  an  initial  point  satisfying  Ax  —  b.  The  operator  g+  =  Ptr+ 
relies  on  a  factorization  of  AAT  from  which  we  can  compute  x  —  AT (AAT)~lb,  while 
factorizations  of  the  system  matrices  in  (16.32)  and  (16.34)  allow  us  to  find  a  suitable  x  by 
solving 


/  AT  ~ 

X 

0 

H  At 

X 

0 

— 

or 

— 

o 

y 

b 

o 

y 

b 

Therefore  we  can  compute  an  initial  point  for  Algorithm  16.2  at  the  cost  of  one  backsolve, 
using  the  factorization  of  the  system  needed  to  perform  the  projection  operators. 

We  point  out  that  these  approaches  for  computing  g+  can  give  rise  to  signifi¬ 
cant  round-off  errors,  so  the  use  of  iterative  refinement  is  recommended  to  improve 
accuracy. 


16.4  INEQUALITY-CONSTRAINED  PROBLEMS 


In  the  remainder  of  the  chapter  we  discuss  several  classes  of  algorithms  for  solving  convex 
quadratic  programs  that  contain  both  inequality  and  equality  constraints.  Active-set  methods 
have  been  widely  used  since  the  1970s  and  are  effective  for  small-  and  medium-sized 
problems.  They  allow  for  efficient  detection  of  unboundedness  and  infeasibility  and  typically 
return  an  accurate  estimate  of  the  optimal  active  set.  Interior-point  methods  are  more  recent, 
having  become  popular  in  the  1990s.  They  are  well  suited  for  large  problems  but  may  not 
be  the  most  effective  when  a  series  of  related  QPs  must  be  solved.  We  also  study  a  special 
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type  of  active-set  methods  called  a  gradient  projection  method,  which  is  most  effective  when 
the  only  constraints  in  the  problem  are  bounds  on  the  variables. 


OPTIMALITY  CONDITIONS  FOR  INEQUALITY-CONSTRAINED  PROBLEMS 

We  begin  our  discussion  with  a  brief  review  of  the  optimality  conditions  for  inequality- 
constrained  quadratic  programming,  then  discuss  some  of  the  less  obvious  properties  of  the 
solutions. 

Theorem  12.1  can  be  applied  to  (16.1)  by  noting  that  the  Lagrangian  for  this  problem 
is 


C(x,  X)  —  \xT  Gx  +  xTc  —  Xj(a^x-bj).  (16.35) 

ielu£ 


As  in  Definition  12.1,  the  active  set  ,4(v*)  consists  of  the  indices  of  the  constraints  for  which 
equality  holds  at  x*: 


A(x*)  —  {;  e  £  UZ I  a/V  =  /?,■}.  (16.36) 

By  specializing  the  KKT  conditions  ( 12.34)  to  this  problem,  we  find  that  any  solution  x* 
of  (16.1)  satisfies  the  following  first-order  conditions,  for  some  Lagrange  multipliers  X*, 
i  e  A(x*): 


Gx*  +  c  -  X*cij  =  0, 

(16.37a) 

i€A(x*) 

£ 

* 

II 

Sr 

for  all  i  e  A(x*), 

(16.37b) 

a 

* 

IV 

Sr 

for  all  i  €  X\A(x*), 

(16.37c) 

X*  >  0, 

for  all  i  elfl  A(x*). 

(16.37d) 

A  technical  point:  In  Theorem  12.1  we  assumed  that  the  linear  independence  con¬ 
straint  qualification  (LICQ)  was  satisfied.  As  mentioned  in  Section  12.6,  this  theorem  still 
holds  if  we  replace  LICQ  by  other  constraint  qualifications,  such  as  linearity  of  the  con¬ 
straints,  which  is  certainly  satisfied  for  quadratic  programming.  Hence,  in  the  optimality 
conditions  for  quadratic  programming  given  above,  we  need  not  assume  that  the  active 
constraints  are  linearly  independent  at  the  solution. 

For  convex  QP,  when  G  is  positive  semidefinite,  the  conditions  (16.37)  are  in  fact 
sufficient  for  x*  to  be  a  global  solution,  as  we  now  prove. 

Theorem  16.4. 

Ifx*  satisfies  the  conditions  (16.37)  for  someX*  ,i  e  A(x*),  andG  is  positive  semidefinite, 
then  x *  is  a  global  solution  of  (16.1). 
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PROOF.  If  x  is  any  other  feasible  point  for  (16.1),  we  have  that  afx  —  b,  for  all  i  e  £  and 
aj x  >  b i  for  all  i  e  >4(x*)  fl  X.  Hence,  aj (x  —  x*)  —  0  for  all  i  e  £  and  af  {x  —  x*)  >  0 
for  all  i  e  A(x*)  fl  X.  Using  these  relationships,  together  with  (16.37a)  and  (16.37d),  we 
have  that 

(. x  —  x*)t(Gx*  +  c)  —  A *aj {x  —  x*)  +  ^  \*aj(x  —  x*)  >  0.  (16.38) 

ieS  ieA(x*)r\l 


By  elementary  manipulation,  we  find  that 

q(x)  —  q(x*)  +  {x  —  x*)r(Gx*  +  c)  +  |(x  —  x*)T  G{x  —  x*) 

>  q(x*)  +  |(x  —  x*)tG(x  —  x*) 

>  q(x*), 

where  the  first  inequality  follows  from  (16.38)  and  the  second  inequality  follows  from 
positive  semidefiniteness  of  G.  We  have  shown  that  q(x)  >  q{x*)  for  any  feasible  x,  so  x* 
is  a  global  solution.  □ 

By  a  trivial  modification  of  this  proof,  we  see  that  x*  is  actually  the  unique  global 
solution  when  G  is  positive  definite. 

We  can  also  apply  the  theory  from  Section  12.5  to  derive  second-order  optimality 
conditions  for  (16.1).  Second-order  sufficient  conditions  for  x*  to  be  a  local  minimizer  are 
satisfied  if  ZT  GZ  is  positive  definite,  where  Z  is  defined  to  be  a  null-space  basis  matrix 
for  the  active  constraint  Jacobian  matrix,  which  is  the  matrix  whose  rows  are  aj  for  all 
i  e  A(x*).  In  this  case,  x*  is  a  strict  local  solution,  according  to  Theorem  12.6. 

When  G  is  not  positive  definite,  the  general  problem  (16.1)  may  have  more  than  one 
strict  local  solution.  As  mentioned  above,  such  problems  are  called  “nonconvex  QPs”  or 
“indefinite  QPs,”  and  they  cause  some  complications  for  algorithms.  Examples  of  indefinite 
QPs  are  illustrated  in  Figure  16.1.  On  the  left  we  have  plotted  the  feasible  region  and 
the  contours  of  a  quadratic  objective  g(x)  in  which  G  has  one  positive  and  one  negative 
eigenvalue.  We  have  indicated  by  +  or  —  that  the  function  tends  toward  plus  or  minus 
infinity  in  that  direction.  Note  that  x**  is  a  local  maximizer,  x*  a  local  minimizer,  and  the 
center  of  the  box  is  a  stationary  point.  The  picture  on  the  right  in  Figure  16.1,  in  which  both 
eigenvalues  of  G  are  negative,  shows  a  global  maximizer  at  x  and  local  minimizers  at  x*  and 


DEGENERACY 

A  second  property  that  causes  difficulties  for  some  algorithms  is  degeneracy.  Con¬ 
fusingly,  this  term  has  been  given  a  variety  of  meanings.  It  refers  to  situations  in 
which 
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Figure  16.1  Nonconvex  quadratic  programs. 


Figure  16.2  Degenerate  solutions  of  quadratic  programs. 


(a)  the  active  constraint  gradients  a,,  i  e  A{x*),  are  linearly  dependent  at  the  solution  x*, 
and/or 

(b)  the  strict  complementarity  condition  of  Definition  12.5  fails  to  hold,  that  is,  there  is 
some  index;  e  ,4(x*)  such  that  all  Lagrange  multipliers  satisfying  (16.37)  have  A.*  =  0. 
(Such  constraints  are  weakly  active  according  to  Definition  12.8.) 


Two  examples  of  degeneracy  are  shown  in  Figure  16.2.  In  the  left-hand  picture,  there 
is  a  single  active  constraint  at  the  solution  x*,  which  is  also  an  unconstrained  minimizer 
of  the  objective  function.  In  the  notation  of  (16.37a),  we  have  that  Gx*  +  c  =  0,  so  that 
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the  lone  Lagrange  multiplier  must  be  zero.  In  the  right-hand  picture,  three  constraints  are 
active  at  the  solution  x*.  Since  each  of  the  three  constraint  gradients  is  a  vector  in  R2,  they 
must  be  linearly  dependent. 

Lack  of  strict  complementarity  is  also  illustrated  by  the  problem 
min  x{  +  (x2  +  l)2  subject  to  x  >  0, 

which  has  a  solution  at  x*  —  0  at  which  both  constraints  are  active.  Strict  complementarity 
does  not  hold  at  x*  because  the  Lagrange  multiplier  associated  with  the  active  constraint 
xi  >  0  is  zero. 

Degeneracy  can  cause  problems  for  algorithms  for  two  main  reasons.  First,  linear 
dependence  of  the  active  constraint  gradients  can  cause  numerical  difficulties  in  the  step 
computation  because  certain  matrices  that  we  need  to  factor  become  rank  deficient.  Second, 
when  the  problem  contains  weakly  active  constraints,  it  is  difficult  for  the  algorithm  to 
determine  whether  these  constraints  are  active  at  the  solution.  In  the  case  of  active-set 
methods  and  gradient  projection  methods  (described  below),  this  indecisiveness  can  cause 
the  algorithm  to  zigzag  as  the  iterates  move  on  and  off  the  weakly  active  constraints  on 
successive  iterations.  Safeguards  must  be  used  to  prevent  such  behavior. 


1 6.5  ACTIVE-SET  METHODS  FOR  CONVEX  QPs 


We  now  describe  active-set  methods  for  solving  quadratic  programs  of  the  form  (16.1) 
containing  equality  and  inequality  constraints.  We  consider  only  the  convex  case,  in  which 
the  matrix  G  in  (16.1a)  is  positive  semidefinite.  The  case  in  which  G  is  an  indefinite  matrix 
raises  complications  in  the  algorithms  and  is  outside  the  scope  of  this  book.  We  refer  to 
Gould  [147]  for  a  discussion  of  nonconvex  QPs. 

If  the  contents  of  the  optimal  active  set  (16.36)  were  known  in  advance,  we  could 
find  the  solution  x*  by  applying  one  of  the  techniques  for  equality-constrained  QP  of 
Sections  16.2  and  16.3  to  the  problem 

min  q{x)  —  \xT Gx  +  xT c  subject  to  aj  x  —  b i  e  .4(x*). 

X  Z 

Of  course,  we  usually  do  not  have  prior  knowledge  of  _4(x*)  and,  as  we  now  see,  de¬ 
termination  of  this  set  is  the  main  challenge  facing  algorithms  for  inequality-constrained 
QP. 

We  have  already  encountered  an  active-set  approach  for  linear  programming  in  Chap¬ 
ter  13,  namely,  the  simplex  method.  In  essence,  the  simplex  method  starts  by  making  a  guess 
of  the  optimal  active  set,  then  repeatedly  uses  gradient  and  Lagrange  multiplier  information 
to  drop  one  index  from  the  current  estimate  of  A(x* )  and  add  a  new  index,  until  optimality 
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is  detected.  Active-set  methods  for  QP  differ  from  the  simplex  method  in  that  the  iterates 
(and  the  solution  x *)  are  not  necessarily  vertices  of  the  feasible  region. 

Active-set  methods  for  QP  come  in  three  varieties:  primal,  dual,  and  primal-dual.  We 
restrict  our  discussion  to  primal  methods,  which  generate  iterates  that  remain  feasible  with 
respect  to  the  primal  problem  (16.1)  while  steadily  decreasing  the  objective  function  q(x). 

Primal  active-set  methods  find  a  step  from  one  iterate  to  the  next  by  solving  a  quadratic 
subproblem  in  which  some  of  the  inequality  constraints  (16.1c),  and  all  the  equality  con¬ 
straints  (16.1b),  are  imposed  as  equalities.  This  subset  is  referred  to  as  the  working  set  and 
is  denoted  at  the  fcth  iterate  Xk  by  Wk-  An  important  requirement  we  impose  on  W*  is  that 
the  gradients  a,  of  the  constraints  in  the  working  set  be  linearly  independent,  even  when 
the  full  set  of  active  constraints  at  that  point  has  linearly  dependent  gradients. 

Given  an  iterate  Xk  and  the  working  set  Wk,  we  first  check  whether  Xk  minimizes 
the  quadratic  q  in  the  subspace  defined  by  the  working  set.  If  not,  we  compute  a  step  p 
by  solving  an  equality-constrained  QP  subproblem  in  which  the  constraints  corresponding 
to  the  working  set  Wk  are  regarded  as  equalities  and  all  other  constraints  are  temporarily 
disregarded.  To  express  this  subproblem  in  terms  of  the  step  p,  we  define 


p  —  x-Xk,  gk  =  Gxk  +  c. 


By  substituting  for  x  into  the  objective  function  (16.1a),  we  find  that 
q{x)  =  q(xk  +  p)  =  \pT Gp  +  gTk  p  +  pk, 

where  pu  =  \xj  Gxk  +  cT Xk  is  independent  of  p.  Since  we  can  drop  pk  from  the  objective 
without  changing  the  solution  of  the  problem,  we  can  write  the  QP  subproblem  to  be  solved 
at  the  fcth  iteration  as  follows: 


min  \pTGp  +  glp  (16.39a) 

p 

subject  to  aJp  =  Q,  ieWk ■  (16.39b) 

We  denote  the  solution  of  this  subproblem  by  pk-  Note  that  for  each  i  e  Wk,  the  value  of 
aj x  does  not  change  as  we  move  along  pk,  since  we  have  af  (xk  +  apk)  —  eij Xk  —  b,  for 
all  a.  Since  the  constraints  in  Wk  were  satisfied  at  Xk,  they  are  also  satisfied  at  Xk  +  a  pk,  for 
any  value  of  a.  Since  G  is  positive  definite,  the  solution  of  (16.39)  can  be  computed  by  any 
of  the  techniques  described  in  Section  16.2. 

Supposing  for  the  moment  that  the  optimal  pk  from  ( 16.39)  is  nonzero,  we  need  to 
decide  how  far  to  move  along  this  direction.  If  Xk  +  Pk  is  feasible  with  respect  to  all  the 
constraints,  we  set  Xk+\  —  Xk  +  Pk •  Otherwise,  we  set 


Xk+l  —  xk  +  oi  k  Pk , 


(16.40) 
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where  the  step-length  parameter  au  is  chosen  to  be  the  largest  value  in  the  range  [0,  1]  for 
which  all  constraints  are  satisfied.  We  can  derive  an  explicit  definition  of  a k  by  considering 
what  happens  to  the  constraints  i  £  Wk,  since  the  constraints  i  e  Wk  will  certainly  be 
satisfied  regardless  of  the  choice  of  oik.  If «/  Pk  >  0  for  some  i  Wk,  then  for  all  at  >  0  we 
have  af  (ay  +  oikPk)  >  aj  >  b,.  Hence,  constraint  i  will  be  satisfied  for  all  nonnegative 
choices  of  the  step-length  parameter.  Whenever  aj  pk  <  0  for  some  i  ^  Wk,  however,  we 
have  that  aj  (xk  +  oik  Pk)  >  b,  only  if 


bi  -  aj xk 

oik  <  - f - 

a,  pk 


To  maximize  the  decrease  in  q,  we  want  oik  to  be  as  large  as  possible  in  [0,  1]  subject  to 
retaining  feasibility,  so  we  obtain  the  following  definition: 


def  . 
oik  =  nun 


(1,  min 

ifWt.a]  pt<0 


(16.41) 


We  call  the  constraints  i  for  which  the  minimum  in  (16.41)  is  achieved  the  blocking  con¬ 
straints.  (If  oik  —  1  and  no  new  constraints  are  active  at  Xk+oikPk,  then  there  are  no  blocking 
constraints  on  this  iteration.)  Note  that  it  is  quite  possible  for  oik  to  be  zero,  because  we 
could  have  aj  pk  <  0  for  some  constraint  i  that  is  active  at  Xk  but  not  a  member  of  the 
current  working  set  Wk- 

If  oi k  <  1,  that  is,  the  step  along  pk  was  blocked  by  some  constraint  not  in  Wk,  a  new 
working  set  Wk+i  is  constructed  by  adding  one  of  the  blocking  constraints  to  Wk  ■ 

We  continue  to  iterate  in  this  manner,  adding  constraints  to  the  working  set  until  we 
reach  a  point  x  that  minimizes  the  quadratic  objective  function  over  its  current  working  set 
W.  It  is  easy  to  recognize  such  a  point  because  the  subproblem  (16.39)  has  solution  p  —  0. 
Since  p  —  0  satisfies  the  optimality  conditions  (16.5)  for  (16.39),  we  have  that 

y,  =  g  —  Gx  +  c,  (16.42) 

ieW 


for  some  Lagrange  multipliers  k,,  i  e  W.  It  follows  that  x  and  X  satisfy  the  first  KKT 
condition  (16.37a),  if  we  define  the  multipliers  corresponding  to  the  inequality  constraints 
that  are  not  in  the  working  set  to  be  zero.  Because  of  the  control  imposed  on  the  step  length, 
x  is  also  feasible  with  respect  to  all  the  constraints,  so  the  second  and  third  KKT  conditions 
(16.37b)  and  (16.37c)  are  satisfied  at  this  point. 

We  now  examine  the  signs  of  the  multipliers  corresponding  to  the  inequality  con¬ 
straints  in  the  working  set,  that  is,  the  indices  i  e  W  fl  T.  If  these  multipliers  are  all 
nonnegative,  the  fourth  KKT  condition  (16.37d)  is  also  satisfied,  so  we  conclude  that  x  is  a 
KKT  point  for  the  original  problem  (16.1).  In  fact,  since  G  is  positive  semidefinite,  we  have 
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from  Theorem  16.4  that  x  is  a  global  solution  of  (16.1).  (As  noted  after  Theorem  16.4,  x  is 
a  strict  local  minimizer  and  the  unique  global  solution  if  G  is  positive  definite.) 

If,  on  the  other  hand,  one  or  more  of  the  multipliers  Xj ,  j  e  TV  IT  X,  is  negative, 
the  condition  (16.37d)  is  not  satisfied  and  the  objective  function  cj(-)  may  be  decreased 
by  dropping  one  of  these  constraints,  as  shown  in  Section  12.3.  Thus,  we  remove  an  index 
j  corresponding  to  one  of  the  negative  multipliers  from  the  working  set  and  solve  a  new 
subproblem  (16.39)  for  the  new  step.  We  show  in  the  following  theorem  that  this  strategy 
produces  a  direction  p  at  the  next  iteration  that  is  feasible  with  respect  to  the  dropped 
constraint.  We  continue  to  assume  that  the  constraint  gradients  a,-  for  i  in  the  working 
set  are  linearly  independent.  After  the  algorithm  has  been  fully  stated,  we  discuss  how  this 
property  can  be  maintained. 

Theorem  16.5. 

Suppose  that  the  point  x  satisfies  first-order  conditions  for  the  equality- constrained 
subproblem  with  working  set  VV;  that  is,  equation  (16.42)  is  satisfied  along  with  afx  =  b ,■  for 
alii  e  VV.  Suppose,  too,  that  the  constraint  gradients  a,,  i  e  VV,  are  linearly  independent  and 
that  there  is  an  index  j  e  VV  such  that  X  j  <  0.  Let  p  be  the  solution  obtained  by  dropping  the 
constraint  j  arid  solving  the  following  subproblem: 


min  \pT Gp  +  ( Gx  +  c)T p, 

p 

subject  to  af  p  —  0,  for  all  i  e  VV  with  i  /  j. 


(16.43a) 

(16.43b) 


Then  p  is  a  feasible  direction  for  constraint  j,  that  is,  aj  p  >  0.  Moreover,  if  p  satisfies  second- 
order  sufficient  conditions  for  (16.43),  then  we  have  that  aj  p  >  0,  and  that  p  is  a  descent 
direction  for  q  (■) . 

PROOF.  Since  p  solves  (16.43),  we  have  from  the  results  of  Section  16.1  that  there  are 
multipliers  X,- ,  for  all  i  e  VV  with  i  f-  j,  such  that 

XiCij  =  Gp  +  (Gx  +  c).  (16.44) 

ievv, ; W 


In  addition,  we  have  by  second-order  necessary  conditions  that  if  Z  is  a  null-space  basis 
vector  for  the  matrix 


KUW. 

then  ZtGZ  is  positive  semidefinite.  Clearly,  p  has  the  form  p  —  Zpz  for  some  vector  pz, 
so  it  follows  that  pT  Gp  >  0. 
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We  have  made  the  assumption  that  x  and  W  satisfy  the  relation  (16.42).  By  subtracting 
(16.42)  from  (16.44),  we  obtain 

(X;  —  X .,■)«,•  —  Xjcij  =  Gp.  (16.45) 


By  taking  inner  products  of  both  sides  with  p  and  using  the  fact  that  of  p  —  0  for  all  i  e  W 
with  i  ^  j,  we  have  that 


—  Xjaj  p  —  pT  Gp.  (16.46) 

Since  pT  Gp  >  0  and  Xj  <  0  by  assumption,  it  follows  that  a]p  >  0. 

If  the  second-order  sufficient  conditions  of  Section  12.5  are  satisfied,  we  have  that 
ZT  GZ  defined  above  is  positive  definite.  From  (16.46),  we  can  have  aj p  —  0  only  if 
pT Gp  =  pf ZT GZpz  —  0,  which  happens  only  if  pz  =  0  and  p  —  0.  But  if  p  =  0,  then 
by  substituting  into  (16.45)  and  using  linear  independence  of  a,-  for  i  e  VV,  we  must  have 
that  Xj  —  0,  which  contradicts  our  choice  of  j.  We  conclude  that  pT  Gp  >  0  in  (16.46),  and 
therefore  ajp>0  whenever  p  satisfies  the  second-order  sufficient  conditions  for  (16.43). 

The  claim  that  p  is  a  descent  direction  for  q(-)  is  proved  in  Theorem  16.6  below.  □ 

While  any  index  j  for  which  Xj  <  0  usually  will  yield  a  direction  p  along  which  the 
algorithm  can  make  progress,  the  most  negative  multiplier  is  often  chosen  in  practice  (and 
in  the  algorithm  specified  below).  This  choice  is  motivated  by  the  sensitivity  analysis  given 
in  Chapter  12,  which  shows  that  the  rate  of  decrease  in  the  objective  function  when  one 
constraint  is  removed  is  proportional  to  the  magnitude  of  the  Lagrange  multiplier  for  that 
constraint.  As  in  linear  programming,  however,  the  step  along  the  resulting  direction  may 
be  short  (as  when  it  is  blocked  by  a  new  constraint),  so  the  amount  of  decrease  in  q  is  not 
guaranteed  to  be  greater  than  for  other  possible  choices  of  j . 

We  conclude  with  a  result  that  shows  that  whenever  pk  obtained  from  (16.39)  is 
nonzero  and  satisfies  second-order  sufficient  optimality  conditions  for  the  current  working 
set,  it  is  a  direction  of  strict  descent  for  q(-). 

Theorem  16.6. 

Suppose  that  the  solution  pk  of  (16.39)  is  nonzero  and  satisfies  the  second-order  sufficient 
conditions  for  optimality  for  that  problem.  Then  the  function  qf)  is  strictly  decreasing  along 
the  direction  p *. 

Proof.  Since  pk  satisfies  the  second-order  conditions,  that  is,  ZT GZ  is  positive  definite 
for  the  matrix  Z  whose  columns  are  a  basis  of  the  null  space  of  the  constraints  (16.39b),  we 
have  by  applying  Theorem  16.2  to  (16.39)  that  pk  is  the  unique  global  solution  of  (16.39). 
Since  p  =  0  is  also  a  feasible  point  for  (16.39),  its  objective  value  in  (16.39a)  must  be  larger 


472  Chapter  16.  Quadratic  Programming 


than  that  of  pk ,  so  we  have 

\Pk  GPk  +  SkPk  <  0. 

Since  pi  Gpk  >  0  by  convexity,  this  inequality  implies  that  g[  pk  <  0.  Therefore,  we  have 

q{xk  +  akpk)  —  q(xk)  +  a g\pk  +  \a2pTkGpk  <  q(xk), 

for  all  a  >  0  sufficiently  small.  □ 

When  G  is  positive  definite — the  strictly  convex  case — the  second-order  sufficient 
conditions  are  satisfied  for  all  feasible  subproblems  of  the  form  (16.39).  Hence,  it  follows 
from  the  result  above  that  we  obtain  a  strict  decrease  in  q(-)  whenever  pk  ^  0.  This  fact  is 
significant  when  we  discuss  finite  termination  of  the  algorithm. 

SPECIFICATION  OF  THE  ACTIVE-SET  METHOD  FOR  CONVEX  QP 

Having  described  the  active-set  algorithm  for  convex  QP,  we  now  present  the  following 
formal  specification.  We  assume  that  the  objective  function  q  is  bounded  in  the  feasible  set 
(16.1b),  (16.1c). 

Algorithm  1 6.3  (Active-Set  Method  for  Convex  QP). 

Compute  a  feasible  starting  point  xq; 

Set  Wo  to  be  a  subset  of  the  active  constraints  at  xq; 
for  k  —  0,  1,  2, . . . 

Solve  (16.39)  to  find  pk; 
if  pk  =  0 

Compute  Lagrange  multipliers  X,  that  satisfy  (16.42), 
with  W  =  W4; 
if  Xj  >  0  for  all  i  e  Wk  fl  X 

stop  with  solution  x *  —  xk; 

else 

j  argmin;eM4nz  Xy, 
xk+i  <-  xk;  Wk+ 1  W4\{j'}; 

else  (*  pk  /  0  *) 

Compute  uk  from  (16.41); 

Xk+ 1  ^  xk  -p  cck  pk , 

if  there  are  blocking  constraints 

Obtain  Wk+\  by  adding  one  of  the  blocking 
constraints  to  Wk; 

else 

Wt+i  <-  Wk; 

end  (for) 
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Various  techniques  can  be  used  to  determine  an  initial  feasible  point.  One  such  is 
to  use  the  “Phase  I”  approach  for  linear  programming  described  in  Chapter  13.  Though 
no  significant  modifications  are  needed  to  generalize  this  method  from  linear  program¬ 
ming  to  quadratic  programming,  we  describe  a  variant  here  that  allows  the  user  to 
supply  an  initial  estimate  x  of  the  vector  x.  This  estimate  need  not  be  feasible,  but  a 
good  choice  based  on  knowledge  of  the  QP  may  reduce  the  work  needed  in  the  Phase  I 
step. 

Given  x,  we  define  the  following  feasibility  linear  program: 

min  eTz 

(*.z) 

subject  to  aj x  +  YiZi  —  b i  e  £, 
aJx  +  YiZi>bi,  iel, 

z  >  0, 

where  e  —  (1,  1,  . . . ,  l)r,  Yi  =  — sign(afv  —  bj)  for  i  e  £,  and  Yi  —  1  for  i  e  X.  A  feasible 
initial  point  for  this  problem  is  then 

x  —  x,  zt  —  | aj x  —  bj |  (i  e  £),  Zi  —  max(Zy  —  aj x,  0)  (i  e  X). 

It  is  easy  to  verify  that  if  x  is  feasible  for  the  original  problem  (16.1),  then  (x,  0)  is  optimal  for 
the  feasibility  subproblem.  In  general,  if  the  original  problem  has  feasible  points,  then  the 
optimal  objective  value  in  the  subproblem  is  zero,  and  any  solution  of  the  subproblem  yields 
a  feasible  point  for  the  original  problem.  The  initial  working  set  Wo  for  Algorithm  16.3  can 
be  found  by  taking  a  linearly  independent  subset  of  the  active  constraints  at  the  solution  of 
the  feasibility  problem. 

An  alternative  approach  is  a  penalty  (or  “big  M”)  method,  which  does  away  with  the 
“Phase  I”  and  instead  includes  a  measure  of  infeasibility  in  the  objective  that  is  guaranteed 
to  be  zero  at  the  solution.  That  is,  we  introduce  a  scalar  artificial  variable  r]  into  (16.1)  to 
measure  the  constraint  violation,  and  we  solve  the  problem 


min  \xT Gx  +  xT c  +  Mn, 

(x,n)  2 

subject  to  (af  x  —  bj)  <  r]. 

i  e  £, 

~(a  j x  -  b,)  <  r], 

i  e  £, 

(16.47) 

bj  —  aj  x  <  r]. 

i  e  T, 

0  <  rj. 

for  some  large  positive  value  of  M.  It  can  be  shown  by  applying  the  theory  of  exact  penalty 
functions  (see  Chapter  17)  that  whenever  there  exist  feasible  points  for  the  original  problem 
(16.1),  then  for  all  M  sufficiently  large,  the  solution  of  (16.47)  will  have  r)  =  0,  with  an  x 
component  that  is  a  solution  for  (16.1). 
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Our  strategy  is  to  use  some  heuristic  to  choose  a  value  of  M  and  solve  (16.47)  by 
the  usual  means.  If  the  solution  we  obtain  has  a  positive  value  of  rj,  we  increase  M  and  try 
again.  Note  that  a  feasible  point  is  easy  to  obtain  for  the  subproblem  (16.47):  We  set  x  —  x 
(where,  as  before,  x  is  the  user-supplied  initial  guess)  and  choose  r]  large  enough  that  all  the 
constraints  in  ( 16.47)  are  satisfied.  This  approach  is,  in  fact,  an  exact  penalty  method  using 
the  loo  norm;  see  Chapter  17. 

A  variant  of  (16.47)  that  penalizes  the  l\  norm  of  the  constraint  violation  rather  than 
the  loo  norm  is  as  follows: 

min  \xT  Gx  +  xTc  +  Me£{s  +  t)  +  Me^v 

(x,s,t,v) 

subject  to  afx  —  bt  +  s ,•  —  f,-  =  0, 
af  x  —  bj  +  v i  >  0, 
s  >  0,  t  >  0,  v  >  0. 

Here,  eg  is  the  vector  (1,1,...,  l)r  of  length  \£\-,  similarly  for  e%.  The  slack  variables  s,-,  b, 
and  v,  soak  up  any  infeasibility  in  the  constraints. 

In  the  following  example  we  use  subscripts  on  the  vectors  x  and  p  to  denote  their 
components,  and  we  use  superscripts  to  indicate  the  iteration  index.  For  example,  ay  denotes 
the  first  component,  while  x4  denotes  the  fourth  iterate  of  the  vector  x. 


i  e  £, 

i  el,  (16.48) 


Figure  16.3  Iterates  of  the  active-set  method. 
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□  Example  1 6.4 

We  apply  Algorithm  16.3  to  the  following  simple  2-dimensional  problem  illustrated 
in  Figure  16.3. 


min  q(x)  —  (xi  —  l)2  +  (X2  —  2.5)2 

X 

(16.49a) 

subject  to  xi  —  2x2  +  2  >  0, 

(16.49b) 

— Xi  —  2x2  +  6  >  0, 

(16.49c) 

— xi  fi-  2.X2  +  2  >  0, 

(16.49d) 

X\  >  0, 

(16.49e) 

x2  >  0. 

(16.49f) 

We  refer  the  constraints,  in  order,  by  indices  1  through  5.  For  this  problem  it  is  easy  to 
determine  a  feasible  initial  point;  say  x°  —  (2,  0)r.  Constraints  3  and  5  are  active  at  this 
point,  and  we  set  Wo  =  {3,  5}.  (Note  that  we  could  just  as  validly  have  chosen  Wo  =  {5} 
or  Wo  =  {3}  or  even  W  =  0;  each  choice  would  lead  the  algorithm  to  perform  somewhat 
differently.) 

Since  x°  lies  on  a  vertex  of  the  feasible  region,  it  is  obviously  a  minimizer  of  the 
objective  function  q  with  respect  to  the  working  set  Wo;  that  is,  the  solution  of  (16.39)  with 
k  —  0  is  p  —  0.  We  can  then  use  (16.42)  to  find  the  multipliers  X3  and  X5  associated  with 
the  active  constraints.  Substitution  of  the  data  from  our  problem  into  (16.42)  yields 


r  -1 1 

0 

r  2  1 

1 

(N 

_ 1 

k3  + 

1 

>> 

II 

- 1 

m 

1 

which  has  the  solution  (X3,  X5)  =  (—2,  —1). 

We  now  remove  constraint  3  from  the  working  set,  because  it  has  the  most  negative 
multiplier,  and  set  W i  =  {5}.  We  begin  iteration  1  by  finding  the  solution  of  (16.39)  for 
k  —  1,  which  is  pl  —  (—  1,  0)T .  The  step-length  formula  ( 16.41)  yields  ct\  =  1,  and  the  new 
iterate  isx2  =  (1,  0)r. 

There  are  no  blocking  constraints,  so  that  W2  =  W 1  =  {5},  and  we  find  at  the  start  of 
iteration  2  that  the  solution  of  (16.39)  is  p 2  =  0.  From  (16.42)  we  deduce  that  the  Lagrange 
multiplier  for  the  lone  working  constraint  is  X5  —  —  5,  so  we  drop  5  from  the  working  set  to 
obtain  W3  =  0. 

Iteration  3  starts  by  solving  the  unconstrained  problem,  to  obtain  the  solution  p}  — 
(0,  2.5)T .  The  formula  (16.41)  yields  a  step  length  of  a3  =  0.6  and  a  new  iterate  x4  = 
(1,  1.5)r.  There  is  a  single  blocking  constraint  (constraint  1),  so  we  obtain  W4  =  {1}.  The 
solution  of  (16.39)  for  k  —  4  is  then  p4  —  (0.4,  0.2)^,  and  the  new  step  length  is  1.  There 
are  no  blocking  constraints  on  this  step,  so  the  next  working  set  is  unchanged:  W5  =  { 1}  ■ 
The  new  iterate  isx5  =  (1.4,  1.7)r. 
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Finally,  we  solve  (16.39)  for  k  =  5  to  obtain  a  solution  p 5  =  0.  The  formula  (16.42) 
yields  a  multiplier  Xi  =  0.8,  so  we  have  found  the  solution.  We  set  x*  =  (1.4,  1.7)r  and 
terminate. 

□ 


FURTHER  REMARKS  ON  THE  ACTIVE-SET  METHOD 

We  noted  above  that  there  is  flexibility  in  the  choice  of  the  initial  working  set  and  that 
each  initial  choice  leads  to  a  different  iteration  sequence.  When  the  initial  active  constraints 
have  independent  gradients,  as  above,  we  can  include  them  all  in  Wo-  Alternatively,  we  can 
select  a  subset.  For  instance,  if  in  the  example  above  we  have  chosen  Wo  =  {3},  the  first 
iterate  would  have  yielded  p°  —  (0.2,  0.1)r  and  a  new  iterate  of  x1  —  (2.2,  0.1)r.  If  we 
had  chosen  Wo  =  {5},  we  would  have  moved  immediately  to  the  new  iterate  x1  —  (1,  0)r, 
without  first  performing  the  operation  of  dropping  the  index  3,  as  is  done  in  the  example.  If 
we  had  selected  Wo  =  0,  we  would  have  obtained  pl  —  (— T,2.5)r,ai  =  | ,  a  new  iterate  of 
x1  =  (|,  | )T ,  and  a  new  working  set  of  Wi  =  {1}.  The  solution  x*  would  have  been  found 
on  the  next  iteration. 

Even  if  the  initial  working  set  Wo  coincides  with  the  initial  active  set,  the  sets  W* 
and  A{xk)  may  differ  at  later  iterations.  For  instance,  when  a  particular  step  encounters 
more  than  one  blocking  constraint,  just  one  of  them  is  added  to  the  working  set,  so  the 
identification  between  W*  and  A{xk)  is  broken.  Moreover,  subsequent  iterates  differ  in 
general  according  to  what  choice  is  made. 

We  require  the  constraint  gradients  in  Wo  to  be  linearly  independent,  and  our  strategy 
for  modifying  the  working  set  ensures  that  this  same  property  holds  for  all  subsequent 
working  sets  Wt .  When  we  encounter  a  blocking  constraint  on  a  particular  step,  its  constraint 
normal  cannot  be  a  linear  combination  of  the  normals  a,-  in  the  current  working  set  (see 
Exercise  16.18).  Flence,  linear  independence  is  maintained  after  the  blocking  constraint  is 
added  to  the  working  set.  On  the  other  hand,  deletion  of  an  index  from  the  working  set 
cannot  introduce  linear  dependence. 

The  strategy  of  removing  the  constraint  corresponding  to  the  most  negative  Lagrange 
multiplier  often  works  well  in  practice  but  has  the  disadvantage  that  it  is  susceptible  to  the 
scaling  of  the  constraints.  (By  multiplying  constraint  i  by  some  factor  ft  >  0  we  do  not 
change  the  geometry  of  the  optimization  problem,  but  we  introduce  a  scaling  of  1/yS  to 
the  corresponding  multiplier  A,-.)  Choice  of  the  most  negative  multiplier  is  analogous  to 
Dantzig’s  original  pivot  rule  for  the  simplex  method  in  linear  programming  (see  Chapter  13) 
and,  as  we  noted  there,  strategies  that  are  less  sensitive  to  scaling  often  give  better  results. 
We  do  not  discuss  this  advanced  topic  further. 

We  note  that  the  strategy  of  adding  or  deleting  at  most  one  constraint  at  each  iteration 
of  the  Algorithm  16.3  places  a  natural  lower  bound  on  the  number  of  iterations  needed 
to  reach  optimality.  Suppose,  for  instance,  that  we  have  a  problem  in  which  m  inequality 
constraints  are  active  at  the  solution  x*  but  that  we  start  from  a  point  x°  that  is  strictly 
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feasible  with  respect  to  all  the  inequality  constraints.  In  this  case,  the  algorithm  will  need 
at  least  m  iterations  to  move  from  x°  to  x*.  Even  more  iterations  will  be  required  if  the 
algorithm  adds  some  constraint  j  to  the  working  set  at  some  iteration,  only  to  remove  it  at 
a  later  step. 

FINITE  TERMINATION  OF  ACTIVE-SET  ALGORITHM  ON  STRICTLY  CONVEX 
QPs 

It  is  not  difficult  to  show  that,  under  certain  assumptions,  Algorithm  16.3  converges 
for  strictly  convex  QPs,  that  is,  it  identifies  the  solution  x*  in  a  finite  number  of  iterations. 
This  claim  is  certainly  true  if  we  assume  that  the  method  always  takes  a  nonzero  step  length 
a £  whenever  the  direction  p *  computed  from  (16.39)  is  nonzero.  Our  argument  proceeds 
as  follows: 

•  If  the  solution  of  ( 16.39)  is  pk  —  0,  the  current  point  x*  is  the  unique  global  minimizer 
of  q(-)  for  the  working  set  W4;  see  Theorem  16.6.  If  it  is  not  the  solution  of  the 
original  problem  (16.1)  (that  is,  at  least  one  of  the  Lagrange  multipliers  is  negative), 
Theorems  16.5  and  16.6  together  show  that  the  step  pk+i  computed  after  a  constraint 
is  dropped  will  be  a  strict  decrease  direction  for  q(-).  Therefore,  because  of  our 
assumption  a*  >  0,  we  have  that  the  value  of  q  is  lower  than  q  (x* )  at  all  subsequent 
iterations.  It  follows  that  the  algorithm  can  never  return  to  the  working  set  W*, 
because  subsequent  iterates  have  values  of  q  that  are  lower  than  the  global  minimizer 
for  this  working  set. 

•  The  algorithm  encounters  an  iterate  k  for  which  pk  —  0  solves  (16.39)  at  least  on 
every  nth  iteration.  To  demonstrate  this  claim,  we  note  that  for  any  A'  at  which  pi  ^  0, 
either  we  have  a*  =  1  (in  which  case  we  reach  the  minimizer  of  q  on  the  current 
working  set  V\4,  so  that  the  next  iteration  will  yield  Pk+\  =  0),  or  else  a  constraint 
is  added  to  the  working  set  W*.  If  the  latter  situation  occurs  repeatedly,  then  after 
at  most  n  iterations  the  working  set  will  contain  n  indices,  which  correspond  to  n 
linearly  independent  vectors.  The  solution  of  (16.39)  will  then  be  pk  =  0,  since  only 
the  zero  vector  will  satisfy  the  constraints  (16.39b). 

•  Taken  together,  the  two  statements  above  indicate  that  the  algorithm  finds  the  global 
minimum  of  q  on  its  current  working  set  periodically  (at  least  once  every  n  iterations) 
and  that,  having  done  so,  it  never  visits  this  particular  working  set  again.  It  follows 
that,  since  there  are  only  a  finite  number  of  possible  working  sets,  the  algorithm 
cannot  iterate  forever.  Eventually,  it  encounters  a  minimizer  for  a  current  working  set 
that  satisfies  optimality  conditions  for  (16.1),  and  it  terminates  with  a  solution. 

The  assumption  that  we  can  always  take  a  nonzero  step  along  a  nonzero  descent 
direction  pk  calculated  from  ( 16.39)  guarantees  that  the  algorithm  does  not  undergo  cycling. 
This  term  refers  to  the  situation  in  which  a  sequence  of  consecutive  iterations  results  in  no 
movement  in  iterate  x ,  while  the  working  set  W*  undergoes  deletions  and  additions  of  indices 
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and  eventually  repeats  itself.  That  is,  for  some  integers  k  and  /  >  1,  we  have  that  xk  —  xk+l 
and  Wk  —  m+h  At  each  iterate  in  the  cycle,  a  constraint  is  dropped  (as  in  Theorem  16.5), 
but  a  new  constraint  i  £  Wk  is  encountered  immediately  without  any  movement  along 
the  computed  direction  p.  Procedures  for  handling  degeneracy  and  cycling  in  quadratic 
programming  are  similar  to  those  for  linear  programming  discussed  in  Chapter  13;  we  do 
not  discuss  them  here.  Most  QP  implementations  simply  ignore  the  possibility  of  cycling. 

UPDATING  FACTORIZATIONS 

We  have  seen  that  the  step  computation  in  the  active-set  method  given  in  Al¬ 
gorithm  16.3  requires  the  solution  of  the  equality-constrained  subproblem  (16.39).  As 
mentioned  at  the  beginning  of  this  chapter,  this  computation  amounts  to  solving  the  KKT 
system  (16.5).  Since  the  working  set  can  change  by  just  one  index  at  every  iteration,  the  KKT 
matrix  differs  in  at  most  one  row  and  one  column  from  the  previous  iteration’s  KKT  matrix. 
Indeed,  G  remains  fixed,  whereas  the  matrix  A  of  constraint  gradients  corresponding  to  the 
current  working  set  may  change  through  addition  and/or  deletion  of  a  single  row. 

It  follows  from  this  observation  that  we  can  compute  the  matrix  factors  needed  to  solve 
( 16.39)  at  the  current  iteration  by  updating  the  factors  computed  at  the  previous  iteration, 
rather  than  recomputing  them  from  scratch.  These  updating  techniques  are  crucial  to  the 
efficiency  of  active-set  methods. 

We  limit  our  discussion  to  the  case  in  which  the  step  is  computed  with  the  null-space 
method  (16.17)— ( 16.20).  Suppose  that  A  has  m  linearly  independent  rows  and  assume  that 
the  bases  Y  and  Z  are  defined  by  means  of  a  QR  factorization  of  A  (see  Section  15.3  for 
details).  Thus 


at  n 


[  Gi  Qi] 


(16.50) 


(see  (15.21)),  where  n  is  a  permutation  matrix;  R  is  square,  upper  triangular  and  nonsingu¬ 
lar;  Q  =  [  Qi  Q2  ]  is  n  x  n  orthogonal;  and  £?i  and  R  both  have  m  columns  while  Q2 
has  n  —  m  columns.  As  noted  in  Chapter  15,  we  can  choose  Z  to  be  simply  the  orthonormal 
matrix  Q2. 

Suppose  that  one  constraint  is  added  to  the  working  set  at  the  next  iteration,  so  that 
the  new  constraint  matrix  is  Ar  =  [  A 7  a  ],  where  a  is  a  column  vector  of  length  n 
such  that  At  retains  full  column  rank.  As  we  now  show,  there  is  an  economical  way  to 
update  the  Q  and  R  factors  in  (16.50)  to  obtain  new  factors  (and  hence  a  new  null-space 
basis  matrix  Z,  with  n  —  m  —  1  columns)  for  the  expanded  matrix  A.  Note  first  that,  since 
GiGf  +  Q2Ql  —  /,  we  have 


n  o 
o  1 


[  Arn  a  ]  =  Q 


R  Q[a 
0  QT2a 


(16.51) 
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We  can  now  define  an  orthogonal  matrix  Q  that  transforms  the  vector  Qlj a  to  a  vector  in 
which  all  elements  except  the  first  are  zero.  That  is,  we  have 

Q(QT2a)=  Yq  , 

where  y  is  a  scalar.  (Since  Q  is  orthogonal,  we  have  ||  Q^a\\  —  \y  |.)  From  (16.51)  we  now 
have 


This  factorization  has  the  form 


We  can  therefore  choose  Z  to  be  the  last  n  —  m  —  1  columns  of  Q2  Q  T .  If' we  know  Z  explicitly 
and  need  an  explicit  representation  of  Z,  we  need  to  account  for  the  cost  of  obtaining  Q  and 
the  cost  of  forming  the  product  Q2QT  —  ZQT .  Because  of  the  special  structure  of  Q,  this 
cost  is  of  order  n  ( n  —  m ) ,  compared  to  the  cost  of  computing  (16.50)  from  scratch,  which  is 
of  order  n2m.  The  updating  strategy  is  less  expensive,  especially  when  the  null  space  is  small 
(that  is,  when  n  —  m  n). 

An  updating  technique  can  also  be  designed  for  the  case  in  which  a  row  is  removed 
from  A.  This  operation  has  the  effect  of  deleting  a  column  from  R  in  ( 16.50),  thus  disturbing 
the  upper  triangular  property  of  this  matrix  by  introducing  a  number  of  nonzeros  on  the 
diagonal  immediately  below  the  main  diagonal  of  the  matrix.  Upper  triangularity  can  be 
restored  by  applying  a  sequence  of  plane  rotations.  These  rotations  introduce  a  number 
of  inexpensive  transformations  into  the  first  m  columns  of  Q,  and  the  updated  null-space 
matrix  is  obtained  by  selecting  the  last  n  —  m  +  1  columns  from  this  matrix  after  the 
transformations  are  complete.  The  new  null-space  basis  in  this  case  has  the  form 

Z=[z  Z  ] .  (16.52) 

that  is,  the  current  matrix  Z  is  augmented  by  a  single  column.  The  total  cost  of  this 
operation  varies  with  the  location  of  the  removed  column  in  A  but  is  in  general  cheaper 
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than  recomputing  a  QR  factorization  from  scratch.  For  details  of  these  procedures,  see  Gill 
et  al.  [124,  Section  5]. 

We  now  consider  the  reduced  Hessian.  Because  of  the  special  form  of  (16.39),  we  have 
h  =  0  in  (16.5),  and  the  step  pY  given  in  (16.18)  is  zero.  Thus  from  (16.19),  the  null-space 
component  pz  is  the  solution  of 


(ZTGZ)pz  —  —ZTg.  (16.53) 

We  can  sometimes  find  ways  of  updating  the  factorization  of  the  reduced  Hessian  ZT GZ 
after  Z  has  changed.  Suppose  that  we  have  the  Cholesky  factorization  of  the  current  reduced 
Hessian,  written  as 


ztgz  =  llt, 

and  that  at  the  next  step  Z  changes  as  in  (16.52),  gaining  a  column  after  deletion  of  a 
constraint.  A  series  of  inexpensive,  elementary  operations  can  be  used  to  transform  the 
Cholesky  factor  L  into  the  new  factor  L  for  the  new  reduced  Hessian  ZT  GZ. 

A  variety  of  other  simplifications  are  possible.  For  example,  as  discussed  in  Sec¬ 
tion  16.7,  we  can  update  the  reduced  gradient  ZT g  at  the  same  time  as  we  update  Z  to 
Z. 


16.6  INTERIOR-POINT  METHODS 


The  interior-point  approach  can  be  applied  to  convex  quadratic  programs  through  a  simple 
extension  of  the  linear-programming  algorithms  described  in  Chapter  14.  The  resulting 
primal-dual  algorithms  are  easy  to  describe  and  are  quite  efficient  on  many  types  of  problems. 
Extensions  of  interior-point  methods  to  nonconvex  problems  are  discussed  in  Chapter  19. 

For  simplicity,  we  restrict  our  attention  to  convex  quadratic  programs  with  inequality 
constraints,  which  we  write  as  follows: 

min  q(x)  —  \xT  Gx  +  xTc  (16.54a) 

subject  to  Ax  >  b ,  (16.54b) 

where  G  is  symmetric  and  positive  semidefinite  and  where  the  m  x  n  matrix  A  and  right-hand 
side  b  are  defined  by 

A  =  [a,' Lex,  b  —  [bj  ]jex,  X  =  {1,  2, , . . ,  m}. 

(If  equality  constraints  are  also  present,  they  can  be  accommodated  with  simple  extensions 
to  the  approaches  described  below.)  Rewriting  the  KKT  conditions  ( 16.37)  in  this  notation, 
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we  obtain 


Gx  —  ArX  +  c  —  0, 

Ax  —  b  >  0, 

{Ax  —  b)jXj  =0,  i  —  1,2, ,  m, 

X  >  0. 

By  introducing  the  slack  vector  y  >  0,  we  can  rewrite  these  conditions  as 


h  c  —  0, 

(16.55a) 

o 

II 

1 

(16.55b) 

yjXj  —  0,  i  —  1,2,.. 

. ,  m. 

(16.55c) 

(y,  a.)  >  o. 

(16.55d) 

Since  we  assume  that  G  is  positive  semidefinite,  these  KKT  conditions  are  not  only  necessary 
but  also  sufficient  (see  Theorem  16.4),  so  we  can  solve  the  convex  quadratic  program  ( 16.54) 
by  finding  solutions  of  the  system  (16.55). 

Given  a  current  iterate  (x,  y,  X)  that  satisfies  (y,  X)  >  0,  we  can  define  a 
complementarity  measure  p  by 


m 


(16.56) 


As  in  Chapter  14,  we  derive  path-following,  primal-dual  methods  by  considering  the 
perturbed  KKT  conditions  given  by 


F{x,  y,  X;  op)  — 


Gx  —  A1  X  +  c 
Ax  —  y  —  b 
yAe  —  ope 


=  0, 


(16.57) 


where 


y  —  diag(yi,  y2, . . . ,  ym),  A  =  diag(M,  X2, . . . ,  Xm),  e  —  (1, 1, . . . ,  1)T, 


and  o  e  [0,  1],  The  solutions  of  (16.57)  for  all  positive  values  of  o  and  p  define  the  central 
path ,  which  is  a  trajectory  that  leads  to  the  solution  of  the  quadratic  program  as  op  tends 
to  zero. 

By  fixing  p  and  applying  Newton’s  method  to  (16.57),  we  obtain  the  linear  system 


"  G  0  —At 

Ax 

~fd 

A  -I  0 

Ay 

= 

~rP 

1 

< 

o 

_ i 

Ak 

—Aye  +  ope 

(16.58) 
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where 


rd  —  Gx  —  AT  X  +  c,  rp  —  Ax  —  y  —  b.  (16.59) 

We  obtain  the  next  iterate  by  setting 

(x+,  y+,  X+)  —  (x,  y,  X)  +  a(  Ax,  Ay,  AX),  (16.60) 

where  a  is  chosen  to  retain  the  inequality  (y + ,  A+ )  >  0  and  possibly  to  satisfy  various  other 
conditions. 

In  the  rest  of  the  chapter  we  discuss  several  enhancements  of  this  primal-dual  iteration 
that  make  it  effective  in  practice. 

SOLVING  THE  PRIMAL-DUAL  SYSTEM 

The  major  computational  operation  in  the  interior-point  method  is  the  solution  of 
the  system  ( 16.58).  The  coefficient  matrix  in  this  system  can  be  much  more  costly  to  factor 
than  the  matrix  (14.9)  arising  in  linear  programming  because  of  the  presence  of  the  Hessian 
matrix  G.  It  is  therefore  important  to  exploit  the  structure  of  (16.58)  by  choosing  a  suitable 
direct  factorization  algorithm,  or  by  choosing  an  appropriate  preconditioner  for  an  iterative 
solver. 

As  in  Chapter  14,  the  system  (16.58)  may  be  restated  in  more  compact  forms.  The 
“augmented  system”  form  is 

'  G  —At 

a  a ~ly 

After  a  simple  transformation  to  symmetric  form,  a  symmetric  indefinite  factorization 
scheme  can  be  applied  to  the  coefficient  matrix  in  this  system.  The  “normal  equations”  form 
(14.44a)  is 

(G  +  ATy~l  AA)Ax  —  — rd  +  ATy~l  A[—rp  —  y  +  cr/z  A_1e],  (16.62) 

which  can  be  solved  by  means  of  a  modified  Cholesky  algorithm.  This  approach  is  effective 
if  the  term  Ar(jy-1A)A  is  not  too  dense  compared  with  G,  and  it  has  the  advantage  of 
being  much  smaller  than  (16.61)  if  there  are  many  inequality  constraints. 

The  projected  CG  method  of  Algorithm  16.2  can  also  be  effective  for  solving  the 
primal-dual  system.  We  can  rewrite  (16.58)  in  the  form 


— 1 

o 

o 

1 

_ i 

Ax 

-rd 

o  y~l  a  i 

Ay 

— 

—Ae  +  o  iiy~le 

1 

O 

1 

_ 1 

AX 

~rp 

r  Ax 

-rd 

i 

-< 

< 

_ i 

-rp  +  (-y  +  (rn,A~le) 

(16.61) 


(16.63) 
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and  observe  that  these  are  the  optimality  conditions  for  an  equality-constrained  convex 
quadratic  program  of  the  form  (16.3),  in  which  the  variable  is  (A*,  Ay).  Hence,  we  can 
make  appropriate  substitutions  and  solve  this  system  using  Algorithm  16.2.  This  approach 
may  be  useful  for  problems  in  which  the  direct  factorization  cannot  be  performed  due  to 
excessive  memory  demands.  The  projected  CG  method  does  not  require  that  the  matrix  G 
be  formed  or  factored;  it  requires  only  matrix-vector  products. 

STEP  LENGTH  SELECTION 

We  mentioned  in  Chapter  14  that  interior-point  methods  for  linear  programming  are 
more  efficient  if  different  step  lengths  aprl,  adual  are  used  for  the  primal  and  dual  variables. 
Equation  (14.37)  indicates  that  the  greatest  reduction  in  the  residuals  rh  and  rc  is  obtained 
by  choosing  the  largest  admissible  primal  and  dual  step  lengths.  The  situation  is  different  in 
quadratic  programming.  Suppose  that  we  define  the  new  iterate  as 

(*+,  y+)  =  (x,  y)  +  o!pri(Ax,  Ay),  A+ =  A  +  adualAA,  (16.64) 

where  aprl  and  adual  are  step  lengths  that  ensure  the  positivity  of  (y+,  A+).  By  using  ( 16.58) 
and  (16.59),  we  see  that  the  new  residuals  satisfy  the  following  relations: 

r+  —  (1  -  avn)rp,  (16.65a) 

r+  =  (1  -  adud)rd  +  (apri  -  adual)GA*.  (16.65b) 

If  aprl  =  adual  =  a  then  both  residuals  decrease  linearly  for  all  a  e  (0,  1).  For  different  step 
lengths,  however,  the  dual  residual  rj"  may  increase  for  certain  choices  ofaprl,  adual,  possibly 
causing  divergence  of  the  interior-point  iteration. 

One  option  is  to  use  equal  step  lengths,  as  in  (16.60),  and  to  set  a  —  min(o:rrl,  adual), 

where 


aprl  =  maxja  e  (0,  1]  :  y  +  a  Ay  >  (1  —  r)y},  (16.66a) 

adual  =  maxja  e  (0,  1]  :  X  +  aAk  >  (1  —  r)7.};  (16.66b) 

the  parameter  r  e  (0,  1 )  controls  how  far  we  back  off  from  the  maximum  step  for  which  the 
conditions  y  +  a  Ay  >  0  and  X  +  a  AX  >  0  are  satisfied.  Numerical  experience  has  shown, 
however,  that  using  different  step  lengths  in  the  primal  and  dual  variables  often  leads  to 
faster  convergence.  One  way  to  choose  unequal  step  lengths  is  to  select  (aprl,  adual)  so  as  to 
(approximately)  minimize  the  optimality  measure 

||  Gx+  -  AtX+  +  c || ^  +  ||  Ax+  -  y+  -  b\\\  +  {y+)T z+ , 

subject  toO  <  aprl  <  a?”  andO  <  o,dual  <  adual,  where*"1",  y+,  A.+  are  defined  as  a  function 
of  the  step  lengths  through  (16.64). 
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A  PRACTICAL  PRIMAL-DUAL  METHOD 

The  most  popular  interior-point  method  for  convex  QP  is  based  on  Mehrotra’s 
predictor-corrector,  originally  developed  for  linear  programming  (see  Section  14.2).  The 
extension  to  quadratic  programming  is  straightforward,  as  we  now  show. 

First,  we  compute  an  affine  scaling  step  (Axaff,  Ayaff,  AAaff)  by  setting  a  —  0  in 
( 16.58).  We  improve  upon  this  step  by  computing  a  corrector  step,  which  is  defined  following 
the  same  reasoning  that  leads  to  (14.31).  Next,  we  compute  the  centering  parameter  a  using 
( 14.34).  The  total  step  is  obtained  by  solving  the  following  system  (cf.  (14.35)): 


G  0  -At 

Ax- 

- I'd 

A  -I  0 

Ay 

= 

~rp 

1 

< 

o 

_ 1 

AA 

— Aye  —  AAaif  A31aff  e  +  u/xe 

(16.67) 


We  now  specify  the  algorithm.  For  simplicity,  we  will  assume  in  our  description  that 
equal  step  lengths  are  used  in  the  primal  and  dual  variables  though,  as  noted  above,  unequal 
step  lengths  can  give  slightly  faster  convergence. 


Algorithm  16.4  (Predictor-Corrector  Algorithm  for  QP). 

Compute  (xo,  Jo,  A0)  with  (y0,  A0)  >  0; 
for  k  —  0,  1,  2, . . . 

Set  (x,  y,  X)  —  (xk,  yk,  Xk)  and  solve  (16.58)  with  a  =  0  for 
(Axaff,  Ayaff,  AAaff); 

Calculate  /x  =  yT X/rrr, 

Calculate  aag  —  max{a  e  (0,  1]  |  ( y ,  X)  +  «( Ayaff,  AAaff)  >  0}; 
Calculate  /xaff  =  (y  +  aaff  A yaff)r(A  +  aaff  AXaS)/m; 

Set  centering  parameter  to  a  =  (/uaff//x)3; 

Solve  (16.67)  for  (Ax,  Ay,  AA); 

Choose  Tk  e  (0,  1)  and  set  a  =  min(a?”,  a^tual)  (see  (16.66)); 
Set  (Xjfc+i,  yk+  i,Xk+1)  =  (xk,  yk,  Xk)  +  a  (Ax,  Ay,  AX); 

end  (for) 


We  can  choose  to  approach  1  as  the  iterates  approach  the  solution,  to  accelerate  the 
convergence. 

As  for  linear  programming,  efficiency  and  robustness  of  this  approach  is  greatly 
enhanced  if  we  choose  a  good  starting  point.  This  selection  can  be  done  in  several  ways.  The 
following  simple  heuristic  accepts  an  initial  point  (x,  y,  X)  from  the  user  and  moves  it  far 
enough  away  from  the  boundary  of  the  region  (y,  A)  >  0  to  permit  the  algorithm  to  take 
long  steps  on  early  iterations.  First,  we  compute  the  affine  scaling  step  ( Axaff,  Ayaff,  AAaff) 
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from  the  user-supplied  initial  point  (x,y,k),  then  set 

y0  —  max(l,  \y  +  Ayaff|),  A0  =  max(l,  \X  +  AAaff |),  Xo  —  x, 

where  the  max  and  absolute  values  are  applied  component-wise. 

We  conclude  this  section  by  contrasting  some  of  the  properties  of  active-set  and 
interior-point  methods  for  convex  quadratic  programming.  Active-set  methods  generally 
require  a  large  number  of  steps  in  which  each  search  direction  is  relatively  inexpensive  to 
compute,  while  interior-point  methods  take  a  smaller  number  of  more  expensive  steps. 
Active-set  methods  are  more  complicated  to  implement,  particularly  if  the  procedures  for 
updating  matrix  factorizations  try  to  take  advantage  of  sparsity  or  structure  in  G  and  A.  By 
contrast,  the  nonzero  structure  of  the  matrix  to  be  factored  at  each  interior-point  iteration 
remains  the  same  at  all  iterations  (though  the  numerical  values  change),  so  standard  sparse 
factorization  software  can  be  used  to  obtain  the  steps.  For  particular  sparsity  structures  (for 
example,  bandedness  in  the  matrices  A  and  G),  efficient  customized  solvers  for  the  linear 
system  arising  at  each  interior-point  iteration  can  be  devised. 

For  very  large  problems,  interior-point  methods  are  often  more  efficient.  However, 
when  an  estimate  of  the  solution  is  available  (a  “warm  start”),  the  active-set  approach  may 
converge  rapidly  in  just  a  few  iterations,  particularly  if  the  initial  value  of  x  is  feasible. 
Interior-point  methods  are  less  able  to  exploit  a  warm  start,  though  research  efforts  to 
improve  their  performance  in  this  regard  are  ongoing. 
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In  the  active-set  method  described  in  Section  16.5,  the  active  set  and  working  set  change 
slowly,  usually  by  a  single  index  at  each  iteration.  This  method  may  thus  require  many 
iterations  to  converge  on  large-scale  problems.  For  instance,  if  the  starting  point  x°  has  no 
active  constraints,  while  200  constraints  are  active  at  the  (nondegenerate)  solution,  then  at 
least  200  iterations  of  the  active-set  method  will  be  required  to  reach  the  solution. 

The  gradient  projection  method  allows  the  active  set  to  change  rapidly  from  iteration 
to  iteration.  It  is  most  efficient  when  the  constraints  are  simple  in  form — in  particular, 
when  there  are  only  bounds  on  the  variables.  Accordingly,  we  restrict  our  attention  to  the 
following  bound-constrained  problem: 

min  q(x)  —  \xT  Gx  +  xTc  (16.68a) 

X  z 

subject  to  l  <x  <u,  (16.68b) 


where  G  is  symmetric  and  /  and  u  are  vectors  of  lower  and  upper  bounds  on  the  components 
ofx.  We  do  not  make  any  positive  definiteness  assumptions  on  G  in  this  section,  because  the 
gradient  projection  approach  can  be  applied  to  both  convex  and  nonconvex  problems.  The 
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feasible  region  defined  by  (16.68b)  is  sometimes  called  a  “box”  because  of  its  rectangular 
shape.  Some  components  of  x  may  lack  an  upper  or  a  lower  bound;  we  handle  these  cases 
formally  by  setting  the  appropriate  components  of  /  and  u  to  — oo  and  +oo,  respectively. 

Each  iteration  of  the  gradient  projection  algorithm  consists  of  two  stages.  In  the  first 
stage,  we  search  along  the  steepest  descent  direction  from  the  current  point  x,  that  is,  the 
direction  —g,  where  g  =  Gx  +  c;  see  (16.6).  Whenever  a  bound  is  encountered,  the  search 
direction  is  “bent”  so  that  it  stays  feasible.  We  search  along  the  resulting  piecewise-linear 
path  and  locate  the  first  local  minimizer  of  q,  which  we  denote  by  xc  and  refer  to  as  the 
Cauchy  point,  by  analogy  with  our  terminology  of  Chapter  4.  The  working  set  is  now  defined 
to  be  the  set  of  bound  constraints  that  are  active  at  the  Cauchy  point,  denoted  by  A(xc).  In 
the  second  stage  of  each  gradient  projection  iteration,  we  explore  the  face  of  the  feasible  box 
on  which  the  Cauchy  point  lies  by  solving  a  subproblem  in  which  the  active  components  x,- 
for  i  e  A{xc)  are  fixed  at  the  values  x?. 

We  describe  the  gradient  projection  method  in  detail  in  the  rest  of  this  section.  Our 
convention  in  this  section  is  to  denote  the  iteration  number  by  a  superscript  (that  is,  xk) 
and  use  subscripts  to  denote  the  elements  of  a  vector. 


CAUCHY  POINT  COMPUTATION 

We  now  derive  an  explicit  expression  for  the  piecewise-linear  path  obtained  by  pro¬ 
jecting  the  steepest  descent  direction  onto  the  feasible  box,  and  outline  the  search  procedure 
for  identifying  the  first  local  minimum  of  q  along  this  path. 

The  projection  of  an  arbitrary  point  x  onto  the  feasible  region  ( 16.68b)  is  defined  as 
follows.  The  ith  component  is  given  by 


h 

if 

X, 

<li 

P(x,  l,  u)i  =  ■ 

X, 

if 

X, 

e  [h 

Uj 

if 

Xi 

>  Ui 

uA,  (16.69) 


(We  assume,  without  loss  of  generality,  that  <  m,  for  all  i.)  The  piecewise-linear  path  x(t) 
starting  at  the  reference  point  x  and  obtained  by  projecting  the  steepest  descent  direction  at 
x  onto  the  feasible  region  (16.68b)  is  thus  given  by 


x(t)  —  P(x  —  tg,  l ,  u),  (16.70) 

where  g  —  Gx  +  c;  see  Figure  16.4. 

The  Cauchy  point  xc ,  is  defined  as  the  first  local  minimizer  of  the  univariate,  piecewise- 
quadratic  function  q(x(t)),  for  t  >  0.  This  minimizer  is  obtained  by  examining  each  of  the 
line  segments  that  make  up  x(f ).  To  perform  this  search,  we  need  to  determine  the  values  of 
t  at  which  the  kinks  in  x(t),  or  breakpoints,  occur.  We  first  identify  the  values  of  t  for  which 
each  component  reaches  its  bound  along  the  chosen  direction  —  g.  These  values  t,  are  given 
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by  the  following  explicit  formulae: 


ti  = 


(x,  -  Uj)/gi  if  gi  <  0  and  u,  <  +oo, 
(xi  -  U)/gi  if  gi  >  0  and/;  >  -oo, 


oo 


otherwise. 


(16.71) 


iff  <  ti, 
otherwise. 

To  search  for  the  first  local  minimizer  along  P{x  —  tg,l,u),  we  eliminate  the  duplicate 
values  and  zero  values  of  F;  from  the  set  {ti,t2, ...  ,tn},  to  obtain  a  sorted,  reduced  set 
of  breakpoints  {t\,t2, ...  ,t{\  with  0  <  t\  <  t2  <  ■  ■  ■■  We  now  examine  the  intervals 
[0,  fi],  [t\  ,h\,  [h,ts\, ...  in  turn.  Suppose  we  have  examined  up  to  f;-_i  and  have  not  yet 
found  a  local  minimizer.  For  the  interval  [f/_i,  f;  ] ,  we  have  that 

x{t)  —  x(tj-i)  +  (Af)/T-,_1, 


The  components  of  x(t )  for  any  f  are  therefore 

X;(f)  = 


where 


At  —  t  —  t j_ i  G  ( 0 ,  t j  —  t j—i  ] , 


488  Chapter  16.  Quadratic  Programming 


and 


~gi  if  O'— 1  <  ti, 
0  otherwise. 


(16.72) 


We  can  then  write  the  quadratic  (16.68a)  on  the  line  segment  [x{tj-i),x{tj)\  as 
follows: 


q{x{t))  =  c7'(.r(T;_i)  +  {At)pJ  J)  +  f(x(0-i)  +  {At)pJ  1)TG{x{tj_1)  +  (At)pJ  l). 
Expanding  and  grouping  the  coefficients  of  1,  At,  and  {At)2,  we  find  that 

q(x(t))  =  fj +  /;_!  At  +  \f"_l(At)2,  At  e  [0,  tj  -  tj-i],  (16.73) 

where  the  coefficients  fj- 1,  /j_p  and  are  defined  by 

fj- i  =  crr(l/-i)  +  ix(r7_i)7'G.r(f;_i), 

/j.1  =  crpM+*(fH)rGpf1, 

/"_!  =  (^'-1)rG^-1. 

Differentiating  (16.73)  with  respect  to  At  and  equating  to  zero,  we  obtain  At*  = 
—  /j_i //"_!■  The  following  cases  can  occur,  (i)  If  /j_j  >  0  there  is  a  local  minimizer 
of  q{x(t))  at  T  =  T/_i;  else  (ii)  At*  e  [0,  tj  —  T7_i)  there  is  a  minimizer  at  T  =  T/_i  +  AT*; 
(iii)  in  all  other  cases  we  move  on  to  the  next  interval  [tj,  0+i  1  and  continue  the 
search. 

For  the  next  search  interval,  we  need  to  calculate  the  new  direction  p-i  from  (16.72), 
and  we  use  this  new  value  to  calculate  fj,  fj,  and  fj.  Since  p'  differs  from  p7-1  typically 
in  just  one  component,  computational  savings  can  be  made  by  updating  these  coefficients 
rather  than  computing  them  from  scratch. 


SUBSPACE  MINIMIZATION 

After  the  Cauchy  point  xc  has  been  computed,  the  components  of  xc  that  are  at  their 
lower  or  upper  bounds  define  the  active  set 

A{xc)  —  {;  | xCj  =  l,  or  xf  =  Uj}. 

In  the  second  stage  of  the  gradient  projection  iteration,  we  approximately  solve  the  QP 
obtained  by  fixing  the  components  Xj  for  i  e  A(xc)  at  the  values  xf.  The  remaining 


16.7.  The  Gradient  Projection  Method  489 


components  are  determined  from  the  subproblem 

min  q(x)  —  \xT Gx  +  xT c  (16.74a) 

X  Z 

subject  to  Xi—xf,  i  e  A(xc),  (16.74b) 

/,■  <  xi  <  Mi,  i  ^  A{xc).  (16.74c) 

It  is  not  necessary  to  solve  this  problem  exactly.  Nor  is  it  desirable  in  the  large-dimensional 
case,  because  the  subproblem  may  be  almost  as  difficult  as  the  original  problem  (16.68). 
In  fact,  to  obtain  global  convergence  of  the  gradient  projection  procedure,  we  require  only 
that  the  approximate  solution  x+  of  (16.74)  is  feasible  with  respect  to  (16.68b)  and  has  an 
objective  function  value  no  worse  than  that  ofxc,  that  is,  q(x+)  <  q(xc).  A  strategy  that  is 
intermediate  between  choosing  x+  —  xc  as  the  approximate  solution  (on  the  one  hand)  and 
solving  ( 16.74)  exactly  (on  the  other  hand)  is  to  compute  an  approximate  solution  of  ( 16.74) 
by  using  the  conjugate  gradient  iteration  described  in  Algorithm  16.1  or  Algorithm  16.2. 
Note  that  for  the  equality  constraints  (16.74b),  the  Jacobian  A  and  the  null-space  basis 
matrix  Z  have  particularly  simple  forms.  We  could  therefore  apply  conjugate  gradient  to 
the  problem  (16.74a),  (16.74b)  and  terminate  as  soon  as  abound  /  <  x  <  u  is  encountered. 
Alternatively,  we  could  continue  to  iterate,  temporarily  ignoring  the  bounds  and  projecting 
the  solution  back  onto  the  box  constraints.  The  negative-curvature  case  can  be  handled 
as  in  Algorithm  7.2,  the  method  for  approximately  solving  possibly  indefinite  trust-region 
subproblems  in  unconstrained  optimization. 

We  summarize  the  gradient  projection  algorithm  for  quadratic  programming  as 
follows. 

Algorithm  16.5  (Gradient  Projection  Method  for  QP). 

Compute  a  feasible  starting  point  x°; 

for  k  —  0,  1,2,... 

if  xk  satisfies  the  KKT  conditions  for  (16.68) 
stop  with  solution  x*  =  xk; 

Set  x  —  xk  and  find  the  Cauchy  point  xc; 

Find  an  approximate  solution  x+  of  (16.74)  such  that  g(x+)  <  q{xc ) 
and  x+  is  feasible; 

xk+1  «—  x+; 

end  (for) 

If  the  algorithm  approaches  a  solution  x*  at  which  the  Lagrange  multipliers  associated 
with  all  the  active  bounds  are  nonzero  (that  is,  strict  complementarity  holds),  the  active  sets 
A(xc)  generated  by  the  gradient  projection  algorithm  are  equal  to  the  optimal  active  set  for 
all  k  sufficiently  large.  That  is,  constraint  indices  do  not  repeatedly  enter  and  leave  the  active 
set  on  successive  iterations.  When  the  problem  is  degenerate,  the  active  set  may  not  settle 
down  at  its  optimal  value.  Various  devices  have  been  proposed  to  prevent  this  undesirable 
behavior  from  taking  place. 
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While  gradient  projection  methods  can  be  applied  in  principle  to  problems  with  gen¬ 
eral  linear  constraints,  significant  computation  may  be  required  to  perform  the  projection 
onto  the  feasible  set  in  such  cases.  For  example,  if  the  constraint  set  is  defined  as  afx  >  b ,• , 
i  e  I,  we  must  solve  the  following  convex  quadratic  program  to  compute  the  projection  of 
a  given  point  x  onto  this  set: 

max  ||.r  —  x\\2  subject  to  aj x  >  b,  for  all  i  e  X. 

X 

The  expense  of  solving  this  “projection  subproblem”  may  approach  the  cost  of  solving  the 
original  quadratic  program,  so  it  is  usually  not  economical  to  apply  gradient  projection  to 
this  case. 

When  we  use  duality  to  replace  a  strictly  convex  quadratic  program  with  its  dual 
(see  Example  12.12),  the  gradient  projection  method  maybe  useful  in  solving  the  bound- 
constrained  dual  problem,  which  is  formulated  in  terms  of  the  Lagrange  multipliers  X  as 
follows: 


max  q(X)  —  — (ATX  —  c)TG  1(AT X  —  c)T  +  bT X,  subject  to  X  >  0. 
x  2 

(Note  that  the  dual  is  conventionally  written  as  a  maximization  problem;  we  can  equivalently 
minimize  —q{X)  and  note  that  this  transformed  problem  is  convex.)  This  approach  is  most 
useful  when  G  has  a  simple  form,  for  example,  a  diagonal  or  block-diagonal  matrix. 


1 6.8  PERSPECTIVES  AND  SOFTWARE 


Active-set  methods  for  convex  quadratic  programming  are  implemented  in  qpopt  [126], 
ve09  [142],  bqpd  [103],  and  Qpa  [148].  Several  commercial  interior-point  solvers  for  QP 
are  available,  including  cplex  [172],  xpress-mp  [159]  and  mosek  [5].  The  code  qpb  [146] 
uses  a  two-phase  interior-point  method  that  can  handle  convex  and  nonconvex  problems. 
OOPS  [139]  and  ooqp  [121]  are  object-oriented  interior-point  codes  that  allow  the  user 
to  customize  the  linear  algebra  techniques  to  the  particular  structure  of  the  data  for  an 
application.  Some  nonlinear  programming  interior-point  packages,  such  as  loqo  [294]  and 
knitro  [46],  are  also  effective  for  convex  and  nonconvex  quadratic  programming. 

The  numerical  comparison  of  active-set  and  interior-point  methods  for  convex 
quadratic  programming  reported  in  [149]  indicates  that  interior-point  methods  are  gener¬ 
ally  much  faster  on  large  problems.  If  a  warm  start  is  required,  however,  active-set  methods 
may  be  generally  preferable.  Although  considerable  research  has  been  focused  on  improving 
the  warm-start  capabilities  of  interior-point  methods,  the  full  potential  of  such  techniques 
is  now  yet  known. 

We  have  assumed  in  this  chapter  that  all  equality-constrained  quadratic  programs  have 
linearly  independent  constraints,  that  is,  the  m  x  n  constraint  Jacobian  matrix  A  has  rank  m . 
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If  redundant  constraints  are  present,  they  can  be  detected  by  forming  a  SVD  or  rank- revealing 
QR  factorization  of  AT ,  and  then  removed  from  the  formulation.  When  A  is  larger,  sparse 
Gaussian  elimination  techniques  can  be  applied  to  AT  instead,  but  they  are  less  reliable. 

The  knitro  and  OOPS  software  packages  provide  the  option  of  solving  the  primal-dual 
equations  (16.63)  by  means  of  the  projected  CG  iteration  of  Algorithm  16.2. 

We  have  not  considered  active-set  methods  for  the  case  in  which  the  Hessian  matrix 
G  is  indefinite  because  these  methods  can  be  quite  complicated  to  describe  and  it  is  not  well 
understood  how  to  adapt  them  to  the  large  dimensional  case.  We  make  some  comments 
here  on  the  principal  techniques. 

Algorithm  16.3,  the  active-set  method  for  convex  QP,  can  be  adapted  to  this  indefinite 
case  by  modifying  the  computation  of  the  search  direction  and  step  length  in  certain 
situations.  To  explain  the  need  for  the  modification,  we  consider  the  computation  of  a 
step  by  a  null-space  method,  that  is,  p  —  Zpz,  where  pz  is  given  by  (16.53).  If  the  reduced 
Hessian  ZT  GZ  is  positive  definite,  then  this  step  p  points  to  the  minimizer  of  the  subproblem 
(16.39),  and  the  logic  of  the  iteration  need  not  be  changed.  If  ZT  GZ  has  negative  eigenvalues, 
however,  p  points  only  to  a  saddle  point  of  (16.39)  and  is  therefore  not  always  a  suitable 
step.  Instead,  we  seek  an  alternative  direction  sz  that  is  a  direction  of  negative  curvature  for 
ZT GZ.  We  then  have  that 


q(x  +  aZsz)  — >  —  oo  as  a  — >- oo.  (16.75) 

Additionally,  we  change  the  sign  of  sz  if  necessary  to  ensure  that  Zsz  is  a  non-ascent  direction 
for  q  at  the  current  point  x,  that  is,  Vg(x)r  Zsz  <  0.  By  moving  along  the  direction  Zsz,  we 
will  encounter  a  constraint  that  can  be  added  to  the  working  set  for  the  next  iteration.  (If  we 
don’t  find  such  a  constraint,  the  problem  is  unbounded.)  If  the  reduced  Hessian  for  the  new 
working  set  is  not  positive  definite,  we  repeat  this  process  until  enough  constraints  have  been 
added  to  make  the  reduced  Hessian  positive  definite.  A  difficulty  with  this  general  approach, 
however,  is  that  if  we  allow  the  reduced  Hessian  to  have  several  negative  eigenvalues,  it 
is  difficult  to  make  these  methods  efficient  when  the  reduced  Hessian  changes  from  one 
working  set  to  the  next. 

Inertia  controlling  methods  are  a  practical  class  of  algorithms  for  indefinite  QP  that 
never  allow  the  reduced  Hessian  to  have  more  than  one  negative  eigenvalue.  As  in  the  convex 
case,  there  is  a  preliminary  phase  in  which  a  feasible  starting  point  xo  is  found.  We  place 
the  additional  demand  on  xo  that  it  be  either  a  vertex  (in  which  case  the  reduced  Hessian  is 
the  null  matrix)  or  a  constrained  stationary  point  at  which  the  reduced  Hessian  is  positive 
definite.  At  each  iteration,  the  algorithm  will  either  add  or  remove  a  constraint  from  the 
working  set.  If  a  constraint  is  added,  the  reduced  Hessian  is  of  smaller  dimension  and  must 
remain  positive  definite  or  be  the  null  matrix.  Therefore,  an  indefinite  reduced  Hessian  can 
arise  only  when  one  of  the  constraints  is  removed  from  the  working  set,  which  happens 
only  when  the  current  point  is  a  minimizer  with  respect  to  the  current  working  set.  In  this 
case,  we  will  choose  the  new  search  direction  to  be  a  direction  of  negative  curvature  for  the 
reduced  Hessian. 
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Various  algorithms  for  indefinite  QP  differ  in  the  way  that  indefiniteness  is  detected, 
in  the  computation  of  the  negative  curvature  direction,  and  in  the  handling  of  the  working 
set;  see  Fletcher  [99]  and  Gill  and  Murray  [126]. 

NOTES  AND  REFERENCES 

The  problem  of  determining  whether  a  feasible  point  for  a  nonconvex  QP  (16.1)  is  a 
global  minimizer  is  NP-hard  (Murty  and  Kabadi  [219]);  so  is  the  problem  of  determining 
whether  a  given  point  is  a  local  minimizer  (Vavasis  [296,  Theorem  5.1]).  Various  algorithms 
for  convex  QP  with  polynomial  convexity  are  discussed  in  Nesterov  and  Nemirovskii  [226] . 

The  portfolio  optimization  problem  was  formulated  by  Markowitz  [201]. 

For  a  discussion  on  the  QMR,  LSQR,  and  GMRES  methods  see,  for  example,  [136, 
272,  290].  The  idea  of  using  the  projection  (16.30)  in  the  CG  method  dates  back  to  at 
least  Polyak  [238].  The  alternative  (16.34),  and  its  special  case  (16.32),  are  proposed  in 
Coleman  [64] .  Although  it  can  give  rise  to  substantial  rounding  errors,  they  can  be  corrected 
by  iterative  refinement;  see  Gould  et  al.  [143].  More  recent  studies  on  preconditioning  of 
the  projected  CG  method  include  Keller  et  al.  [176]  and  Luksan  and  Vlcek  [196]. 

For  further  discussion  on  the  gradient  projection  method  see,  for  example,  Conn, 
Gould,  and  Toint  [70]  and  Burke  and  More  [44], 

In  some  areas  of  application,  the  KKT  matrix  ( 16.7)  not  only  is  sparse  but  also  contains 
special  structure.  For  instance,  the  quadratic  programs  that  arise  in  many  control  problems 
have  banded  matrices  G  and  A  (see  Wright  [315]),  which  can  be  exploited  by  interior-point 
methods  via  a  suitable  symmetric  reordering  of  K.  When  active-set  methods  are  applied  to 
this  problem,  however,  the  advantages  of  bandedness  and  sparsity  are  lost  after  just  a  few 
updates  of  the  factorization. 

Further  details  of  interior-point  methods  for  convex  quadratic  programming  can 
be  found  in  Wright  [316]  and  Vanderbei  [293].  The  first  inertia-controlling  method  for 
indefinite  quadratic  programming  was  proposed  by  Fletcher  [99].  See  also  Gill  et  al.  [129] 
and  Gould  [142]  for  a  discussion  of  methods  for  general  quadratic  programming. 


&  Exercises 

&  16.1 

(a)  Solve  the  following  quadratic  program  and  illustrate  it  geometrically. 

min  f(x)  =  2x\  +  3x2  +  Ax\  +  2x\X2  +  x\, 
subject  to  X]  —  X2  >  0,  x\  +  an  <  4,  X\  <  3. 

(b)  If  the  objective  function  is  redefined  as  q{x)  —  —f(x),  does  the  problem  have  a  finite 
minimum?  Are  there  local  minimizers? 
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16.2  The  problem  of  finding  the  shortest  distance  from  a  point  x0  to  the  hyperplane 
{x  |  Ax  —  b},  where  A  has  full  row  rank,  can  be  formulated  as  the  quadratic  program 

min  |(x  —  x0)T(x  —  x0)  subject  to  Ax  —  b. 

Show  that  the  optimal  multiplier  is 

A*  =  {AAr)~\b  -  Ax 0) 


and  that  the  solution  is 


x*  =  x0  +  At(AAt)  1(b  —  Ax  0). 

Show  that  in  the  special  case  in  which  A  is  a  row  vector,  the  shortest  distance  from  xo  to  the 
solution  set  of  Ax  —  b  is  \b  —  Axo|/||  A ||2- 

i#7  16.3  Use  Theorem  12.1  to  verify  that  the  first-order  necessary  conditions  for  (16.3) 

are  given  by  (16.4). 

i#7  16.4  Suppose  that  G  is  positive  semidefinite  in  (16.1)  and  that  x*  satisfies  the  KKT 

conditions  (16.37)  for  some  A.*,  i  e  A(x*).  Suppose  in  addition  that  second-order  sufficient 
conditions  are  satisfied,  that  is,  ZT  GZ  is  positive  definite  where  the  columns  of  Z  span  the 
null  space  of  the  active  constraint  Jacobian  matrix.  Show  that  x*  is  in  fact  the  unique  global 
solution  for  (16.1),  that  is,  q{x)  >  q(x*)  for  all  feasible  x  with  x  ^  x*. 

i#7  16.5  Verify  that  the  inverse  of  the  KKT  matrix  is  given  by  (16.16). 

i#7  16.6  Use  Theorem  12.6  to  show  that  if  the  conditions  of  Lemma  16.1  hold,  then  the 

second-order  sufficient  conditions  for  (16.3)  are  satisfied  by  the  vector  pair  (x*.  A*)  that 
satisfies  (16.4). 

i#7  16.7  Consider  (16.3)  and  suppose  that  the  projected  Hessian  matrix  ZT GZ  has  a 

negative  eigenvalue;  that  is,  u 1 Z  T  G  Zu  <  0  for  some  vector  u .  Show  that  if  there  exists  any 
vector  pair  (x*,  A*)  that  satisfies  (16.4),  then  the  point  x*  is  only  a  stationary  point  of  (16.3) 
and  not  a  local  minimizer.  (Hint:  Consider  the  function  q(x*  +  a  Zu)  for  a  ^  0,  and  use 
an  expansion  like  that  in  the  proof  of  Theorem  16.2.) 

i#7  16.8  By  using  the  QR  factorization  and  a  permutation  matrix,  show  that  for  a  full- 

rank  m  x  n  matrix  A  (with  m  <  n )  one  can  find  an  orthogonal  matrix  Q  and  an  m  x  m  upper 

triangular  matrix  U  such  that  A  Q  =  0  U  j.  (Hint:  Start  by  applying  the  standard  QR 

factorization  to  AT .) 

i#7  16.9  Verify  that  the  first-order  conditions  for  optimality  of  (16.1)  are  equivalent  to 

(16.37)  when  we  make  use  of  the  active-set  definition  (16.36). 
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&  16.10  For  each  of  the  alternative  choices  of  initial  working  set  Wo  in  the  example 

( 16.49)  (that  is,  Wo  =  {3},  Wo  =  {5},  and  Wo  =  0)  work  through  the  first  two  iterations  of 
Algorithm  16.3. 

&  16.11  Program  Algorithm  16.3,  and  use  it  to  solve  the  problem 

min  xf  +  2x\  —  2x\  —  6x2  —  2x1X2 
subject  to  |xi  +  \x2  <  1,  —X]  +  2x2  <2,  X\,X2>  0. 

Choose  three  initial  starting  points:  one  in  the  interior  of  the  feasible  region,  one  at  a  vertex, 
and  one  at  a  non-vertex  point  on  the  boundary  of  the  feasible  region. 

&  16.12  Show  that  the  operator  P  defined  by  (16.27)  is  independent  of  the  choice  of 

null-space  basis  Z.  (Hint:  First  show  that  any  null-space  basis  Z  can  be  written  as  Z  =  QB 
where  Q  is  an  orthogonal  basis  and  B  is  a  nonsingular  matrix.) 

&  16.13 

(a)  Show  that  the  the  computation  of  the  preconditioned  residual  g+  in  (16.28d)  can  be 
performed  with  (16.29)  or  (16.30). 

(b)  Show  that  we  can  also  perform  this  computation  by  solving  the  system  (16.32). 

(c)  Verify  (16.33). 

&  16.14 

(a)  Show  that  if  ZT  GZ  is  positive  definite,  then  the  denominator  in  (16.28a)  is  nonzero. 

(b)  Show  that  if  ZT r  —  rz  yA  0  and  ZT HZ  is  positive  definite,  then  the  denominator  in 
(16.28e)  is  nonzero. 

&  16.15  Consider  problem  ( 16.3),  and  assume  that  A  has  full  row  rank  and  that  Z  is  a 

basis  for  the  null  space  of  A.  Prove  that  there  are  no  finite  solutions  if  ZT GZ  has  negative 
eigenvalues. 

&  16.16 

(a)  Assume  that  A/0.  Show  that  the  KKT  matrix  (16.7)  is  indefinite. 

(b)  Prove  that  if  the  KKT  matrix  (16.7)  is  nonsingular,  then  A  must  have  full  rank. 

&  16.17  Consider  the  quadratic  program 

max  6x1  +  4x2  —  13  —  x\  —  xf, 
subject  to  x'i  +  X2  <  3,  X\  >  0,  X2  >  0. 


(16.76) 
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First  solve  it  graphically,  and  then  use  your  program  implementing  the  active-set  method 
given  in  Algorithm  16.3. 

16.18  Using  (16.39)  and  (16.41),  explain  briefly  why  the  gradient  of  each  blocking 
constraint  cannot  be  a  linear  combination  of  the  constraint  gradients  in  the  current  working 
set  Wk- 

i#7  16.19  Let  W  be  an  n  x  n  symmetric  matrix,  and  suppose  that  Z  is  of  dimension  n  x  /. 

Suppose  that  ZTWZ  is  positive  definite  and  that  Z  is  obtained  by  removing  a  column  from 
Z.  Show  that  ZT  WZ  is  positive  definite. 

i#7  16.20  Find  a  null-space  basis  matrix  Z  for  the  equality-constrained  problem  defined 

by  (16.74a),  (16.74b). 

i#7  16.21  Write  down  KKT  conditions  for  the  following  convex  quadratic  program  with 

mixed  equality  and  inequality  constraints: 


min  q(x)  —  \xT  Gx  +  xT c  subject  to  Ax  >  b ,  Ax  —  b, 


where  G  is  symmetric  and  positive  semidefinite.  Use  these  conditions  to  derive  an  analogue 
of  the  generic  primal-dual  step  (16.58)  for  this  problem. 

i#7  16.22  Explain  why  for  a  bound-constrained  problems  the  number  of  possible  active 

sets  is  at  most  3". 

i#7  16.23 

(a)  Show  that  the  primal-dual  system  ( 16.58)  can  be  solved  using  the  augmented  system 
(16.61)  or  the  normal  equations  (16.62).  Describe  in  detail  how  all  the  components 
(Ax,  Ay,  A  A.)  are  computed. 

(b)  Verify  (16.65). 

i#7  16.24  Program  Algorithm  16.4  and  use  it  to  solve  problem  (16.76).  Set  all  initial 

variables  to  be  the  vector  e  —  (1,  1, . . . ,  l)r. 

i#7  16.25  Let  x  e  R"  be  given,  and  let  x*  be  the  solution  of  the  projection  problem 

min  ||x  —  x||2  subject  to  /  <  x  <  u.  (16.77) 


For  simplicity,  assume  that  —  oo  <  /,■  <  u ;  <  oo  for  all  i  =  1,2 Show  that  the 
solution  of  this  problem  coincides  with  the  projection  formula  given  by  (16.69)  that  is,  show 
that  x*  =  P(x,  l,  u ).  (Hint:  Note  that  the  problem  is  separable.) 
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&  16.26  Consider  the  bound-constrained  quadratic  problem  (16.68)  with 


4ii  r  -1 1  r  °  l  T5" 

=  ,  c  =  ,  I  —  ,  and  u  = 

!  2  J  [  i  J  |_  0  J  [3 


(16.78) 


Suppose  x°  —  (0,  2)T .  Find  t\ ,  T2,  h,  t2,  p1,  p2  and  x{t\),  tc(T2).  Find  the  minimizer  of 
q(x(t)). 

&  16.27  Consider  the  search  for  the  one  dimensional  minimizer  of  the  function  q(x(t)) 

defined  by  ( 1 6. 73 ).  There  are  9  possible  cases  since  /,  /' ,  /"can  each  be  positive,  negative,  or 
zero.  For  each  case,  determine  the  location  of  the  minimizer.  Verify  that  the  rules  described 
in  Section  16.7  hold. 


Chapter 


Penalty  and 
Augmented 
Lagrangian 
Methods 


Some  important  methods  for  constrained  optimization  replace  the  original  problem  by  a 
sequence  of  subproblems  in  which  the  constraints  are  represented  by  terms  added  to  the 
objective.  In  this  chapter  we  describe  three  approaches  of  this  type.  The  quadratic  penalty 
method  adds  a  multiple  of  the  square  of  the  violation  of  each  constraint  to  the  objective. 
Because  of  its  simplicity  and  intuitive  appeal,  this  approach  is  used  often  in  practice,  al¬ 
though  it  has  some  important  disadvantages.  In  nonsmooth  exact  penalty  methods,  a  single 
unconstrained  problem  (rather  than  a  sequence)  takes  the  place  of  the  original  constrained 
problem.  Using  these  penalty  functions,  we  can  often  find  a  solution  by  performing  a  single 
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unconstrained  minimization,  but  the  nonsmoothness  may  create  complications.  A  popular 
function  of  this  type  is  the  t\  penalty  function.  A  different  kind  of  exact  penalty  approach 
is  the  method  of  multipliers  or  augmented  Lagrangian  method,  in  which  explicit  Lagrange 
multiplier  estimates  are  used  to  avoid  the  ill-conditioning  that  is  inherent  in  the  quadratic 
penalty  function. 

A  somewhat  related  approach  is  used  in  the  log-barrier  method,  in  which  logarithmic 
terms  prevent  feasible  iterates  from  moving  too  close  to  the  boundary  of  the  feasible  re¬ 
gion.  This  approach  forms  part  of  the  foundation  for  interior-point  methods  for  nonlinear 
programming  and  we  discuss  it  further  in  Chapter  19. 


1 7 .1  THE  QUADRATIC  PENALTY  METHOD 

MOTIVATION 

Let  us  consider  replacing  a  constrained  optimization  problem  by  a  single  function 
consisting  of 

-  the  original  objective  of  the  constrained  optimization  problem,  plus 

-  one  additional  term  for  each  constraint,  which  is  positive  when  the  current  point  x 
violates  that  constraint  and  zero  otherwise. 

Most  approaches  define  a  sequence  of  such  penalty  functions,  in  which  the  penalty  terms  for 
the  constraint  violations  are  multiplied  by  a  positive  coefficient.  By  making  this  coefficient 
larger,  we  penalize  constraint  violations  more  severely,  thereby  forcing  the  minimizer  of  the 
penalty  function  closer  to  the  feasible  region  for  the  constrained  problem. 

The  simplest  penalty  function  of  this  type  is  the  quadratic  penalty  function,  in  which 
the  penalty  terms  are  the  squares  of  the  constraint  violations.  We  describe  this  approach 
first  in  the  context  of  the  equality-constrained  problem 

min  f(x)  subject  to  Cj(x)  —  0,  ie£,  (17.1) 

a: 

which  is  a  special  case  of  (12.1).  The  quadratic  penalty  function  Q(x;  fi)  for  this  formulation 
is 


Q(x;  p)  =  f{x)  +  y  J^cf(x),  (17.2) 

ieS 

where  p  >  0  is  the  penalty  parameter.  By  driving  ^  to  oo,  we  penalize  the  constraint 
violations  with  increasing  severity.  It  makes  good  intuitive  sense  to  consider  a  sequence  of 
values  {/T,t}  with  pk  t  oo  as  k  — >•  oo,  and  to  seek  the  approximate  minimizer  Xk  of  Q  (x;  pu) 
for  each  k.  Because  the  penalty  terms  in  (17.2)  are  smooth,  we  can  use  techniques  from 
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unconstrained  optimization  to  search  for  x^.  In  searching  for  Xk,  we  can  use  the  minimizers 
Xk-i,  Xk-i,  etc.,  of  Q(-\  jtt)  for  smaller  values  of  / z  to  construct  an  initial  guess.  For  suitable 
choices  of  the  sequence  {/z/.}  and  the  initial  guesses,  just  a  few  steps  of  unconstrained 
minimization  may  be  needed  for  each  /Xk. 


□  Example  1 7.1 

Consider  the  problem  (12.9)  from  Chapter  12,  that  is, 

min  xi  +  X2  subject  to  xf  +  x\  —  2  =  0,  (17.3) 

for  which  the  solution  is  ( —  1 ,  —  1 ) T  and  the  quadratic  penalty  function  is 

Q{x\  /z)  =  xi  +x2  +  ^  (xj  +x?  -  2 )'  .  (17.4) 

We  plot  the  contours  of  this  function  in  Figures  17.1  and  17.2.  In  Figure  17.1  we  have 
(i=l,  and  we  observe  a  minimizer  of  Q  near  the  point  (—1.1,  —  l.l)r.  (There  is  also  a 
local  maximizer  near  x  =  (0.3,  0.3)r.)  In  Figure  17.2  we  have  /z  =  10,  so  points  that  do  not 
lie  on  the  feasible  circle  defined  by  x\  +  x\  =  2  suffer  a  much  greater  penalty  than  in  the 
first  figure — the  “trough”  of  low  values  of  Q  is  clearly  evident.  The  minimizer  in  this  figure 
is  much  closer  to  the  solution  (— 1,  —  l)r  of  the  problem  (17.3).  A  local  maximum  lies  near 
(0,  0)r,  and  Q  goes  rapidly  to  oo  outside  the  circle  x\  +  x\  =  2. 


Figure  17.1  Contours  of  Q(x\  /z)  from  (17.4)  for  /z  =  1,  contour  spacing  0.5. 
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Figure  17.2  Contours  of  Q(x:  ft)  from  (17.4)  for  /x  =  10,  contour  spacing  2. 


The  situation  is  not  always  so  benign  as  in  Example  17.1.  For  a  given  value  of  the 
penalty  parameter  /x,  the  penalty  function  may  be  unbounded  below  even  if  the  original 
constrained  problem  has  a  unique  solution.  Consider  for  example 

min  — 5Xj  +  x\  subject  to  X\  =  1,  (17.5) 

whose  solution  is  (1,  0)T .  The  penalty  function  is  unbounded  for  any  fi  <  10.  For  such 
values  of  /x,  the  iterates  generated  by  an  unconstrained  minimization  method  would  usually 
diverge.  This  deficiency  is,  unfortunately,  common  to  all  the  penalty  functions  discussed  in 
this  chapter. 

For  the  general  constrained  optimization  problem 

min  f(x )  subject  to  Cj(.r)  =  0,  i  e  £,  Ci(x )  >  0,  i  el,  (17.6) 

X 

which  contains  inequality  constraints  as  well  as  equality  constraints,  we  can  define  the 
quadratic  penalty  function  as 


Q(x;  n)  =  f{x)  +  ^  ^c,2(x)  +  ^  ^([c,(x)]  )2,  (17.7) 

zeX 

where  [y]-  denotes  max(— y,  0).  In  this  case,  Q  maybe  less  smooth  than  the  objective  and 
constraint  functions.  For  instance,  if  one  of  the  inequality  constraints  is  Xi  >  0,  then  the 
function  min(0,  x\)2  has  a  discontinuous  second  derivative,  so  that  Q  is  no  longer  twice 
continuously  differentiable. 


17.1.  The  Quadratic  Penalty  Method  501 


ALGORITHMIC  FRAMEWORK 

A  general  framework  for  algorithms  based  on  the  quadratic  penalty  function  (17.2) 
can  be  specified  as  follows. 

Framework  17.1  (Quadratic  Penalty  Method). 

Given  /To  >  0,  a  nonnegative  sequence  {tt-}  with  rk  — >  0,  and  a  starting  point  x^; 

for  k  —  0,  1,  2,  . . . 

Find  an  approximate  minimizer  xu  of  Q{-\  jxk),  starting  at  xsk, 
and  terminating  when  ||  V¥  Q(x\  Hk) II  <  Tk\ 

if  final  convergence  test  satisfied 

stop  with  approximate  solution  xk; 

end  (if) 

Choose  new  penalty  parameter /r/;+i  >  fij t; 

Choose  new  starting  point  xk+ 1> 

end  (for) 

The  parameter  sequence  {/xk}  can  be  chosen  adaptively,  based  on  the  difficulty  of 
minimizing  the  penalty  function  at  each  iteration.  When  minimization  of  Q(x\  fxk)  proves 
to  be  expensive  for  some  k,  we  choose  nk+i  to  be  only  modestly  larger  than  /xk;  for  instance 
fik+i  =  1 .5 ilk-  If  we  find  the  approximate  minimizer  of  Q{x\  Hk)  cheaply,  we  could  try  a 
more  ambitious  increase,  for  instance  ixk+\  —  10 fik.  The  convergence  theory  for  Frame¬ 
work  17.1  allows  wide  latitude  in  the  choice  of  nonnegative  tolerances  rk;  it  requires  only 
that  xk  — >  0,  to  ensure  that  the  minimization  is  carried  out  more  accurately  as  the  iterations 
progress. 

There  is  no  guarantee  that  the  stop  test  || VxQ(x;  nk)\\  <  rk  will  be  satisfied  be¬ 
cause,  as  discussed  above,  the  iterates  may  move  away  from  the  feasible  region  when  the 
penalty  parameter  is  not  large  enough.  A  practical  implementation  must  include  safe¬ 
guards  that  increase  the  penalty  parameter  (and  possibly  restore  the  initial  point)  when 
the  constraint  violation  is  not  decreasing  rapidly  enough,  or  when  the  iterates  appear  to  be 
diverging. 

When  only  equality  constraints  are  present,  Q(x\  fj.k )  is  smooth,  so  the  algorithms  for 
unconstrained  minimization  described  in  the  first  chapters  of  the  book  can  be  used  to  identify 
the  approximate  solution xk .  However,  the  minimization  of  Q(x;  /xk)  becomes  more  difficult 
to  perform  as  /xk  becomes  large,  unless  we  use  special  techniques  to  calculate  the  search 
directions.  For  one  thing,  the  Hessian  Vxx  Q[x\  (ik)  becomes  arbitrarily  ill  conditioned  near 
the  minimizer.  This  property  alone  is  enough  to  make  many  unconstrained  minimization 
algorithms  such  as  quasi-Newton  and  conjugate  gradient  perform  poorly.  Newton’s  method, 
on  the  other  hand,  is  not  sensitive  to  ill  conditioning  of  the  Hessian,  but  it,  too,  may  encounter 
difficulties  for  large  jlk  for  two  other  reasons.  First,  ill  conditioning  of  Vxx  Q(x\  fik)  might 
be  expected  to  cause  numerical  problems  when  we  solve  the  linear  equations  to  calculate 
the  Newton  step.  We  discuss  this  issue  below,  and  show  that  these  effects  are  not  severe  and 
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that  a  reformulation  of  the  Newton  equations  is  possible.  Second,  even  when  x  is  close  to 
the  minimizer  of  Qf\  /x*),  the  quadratic  Taylor  series  approximation  to  Q(x ;  pk )  about 
x  is  a  reasonable  approximation  of  the  true  function  only  in  a  small  neighborhood  of  x. 
This  property  can  be  seen  in  Figure  17.2,  where  the  contours  of  Q  near  the  minimizer  have 
a  “banana”  shape,  rather  than  the  elliptical  shape  that  characterizes  quadratic  functions. 
Since  Newton’s  method  is  based  on  the  quadratic  model,  the  steps  that  it  generates  may  not 
make  rapid  progress  toward  the  minimizer  of  Q(x\  pk).  This  difficulty  can  be  lessened  by  a 
judicious  choice  of  the  starting  point  Xk+ 1  ,  or  by  setting  xsk+l  =  xk  and  choosing  fik+i  to  be 
only  modestly  larger  than  pk. 


CONVERGENCE  OF  THE  QUADRATIC  PENALTY  METHOD 

We  describe  some  convergence  properties  of  the  quadratic  penalty  method  in  the 
following  two  theorems.  We  restrict  our  attention  to  the  equality-constrained  problem 
( 17.1),  for  which  the  quadratic  penalty  function  is  defined  by  ( 17.2). 

For  the  first  result  we  assume  that  the  penalty  function  Q(x ;  pk)  has  a  (finite) 
minimizer  for  each  value  of  pk. 

Theorem  17.1. 

Suppose  that  each  Xk  is  the  exact  global  minimizer  of  Q(x\  pk)  defined  by  (17.2)  in 
Framework  17.1  above,  and  that  pk  t  °°-  Then  every  limit  point  x*  of  the  sequence  {.%k}  is  a 
global  solution  of  the  problem  (17.1). 

PROOF.  Let  x  be  a  global  solution  of  (17.1),  that  is, 

fix)  <  f(x)  for  all  x  with  c,(x)  =  0,  i  e  £. 

Since  xk  minimizes  Q(-\  pk)  for  each  k,  we  have  that  Qixp,  Pk )  <  Q(x\  pk),  which  leads 
to  the  inequality 


f(xk )  +  <  fix )  +  y  ^c,2(.v)  =  fix).  (17.8) 

ieS  ieS 


By  rearranging  this  expression,  we  obtain 


^2cj(xk)  < 

ieS 


—  [fix)  ~  fixk)]. 
Fk 


(17.9) 


Suppose  that  x*  is  a  limit  point  of  {x^},  so  that  there  is  an  infinite  subsequence  /C  such  that 


lim  Xk  —  x*. 

keK. 
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By  taking  the  limit  as  k  — >  oo,  k  e  1C,  on  both  sides  of  (17.9),  we  obtain 


where  the  last  equality  follows  from  pk  f  oo.  Therefore,  we  have  that  c,(x*)  =  0  for  all 
i  e  8,  so  that  x*  is  feasible.  Moreover,  by  taking  the  limit  as  k  ->  oo  for  k  e  K,  in  ( 17.8),  we 
have  by  nonnegativity  of  /x^.  and  of  each  c,-  (x^)2  that 

f(x*)  <  f(x*)  +  lim  <  f{x). 

ieS 

Since  x*  is  a  feasible  point  whose  objective  value  is  no  larger  than  that  of  the  global  solution 
x,  we  conclude  that  x*,  too,  is  a  global  solution,  as  claimed.  □ 

Since  this  result  requires  us  to  find  the  global  minimizer  for  each  subproblem,  this 
desirable  property  of  convergence  to  the  global  solution  of  (17.1)  cannot  be  attained  in 
general.  The  next  result  concerns  convergence  properties  of  the  sequence  {x* }  when  we  allow 
inexact  (but  increasingly  accurate)  minimizations  of  Q(  \  Hk).ln  contrast  to  Theorem  17.1, 
it  shows  that  the  sequence  may  be  attracted  to  infeasible  points,  or  to  any  KKT  point  (that  is, 
a  point  satisfying  first-order  necessary  conditions;  see  (12.34)),  rather  than  to  a  minimizer.  It 
also  shows  that  the  quantities  pkCi  (xk)  maybe  used  as  estimates  of  the  Lagrange  multipliers 
k*  in  certain  circumstances.  This  observation  is  important  for  the  analysis  of  augmented 
Lagrangian  methods  in  Section  17.3. 

To  establish  the  result  we  will  make  the  (optimistic)  assumption  that  the  stop  test 
||  VA  Q(x\  Hk) ||  <  xk  is  satisfied  for  all  k. 

Theorem  1 7.2. 

Suppose  that  the  tolerances  and  penalty  parameters  in  Framework  17.1  satisfy  rk  ->  0 
and  pk  t  oo.  Then  if  a  limit  point  x*  of  the  sequence  {.Xk}  is  infeasible,  it  is  a  stationary  point 
of  the  function  ||c(x)||2.  On  the  other  hand,  if  a  limit  point  x*  is  feasible  and  the  constraint 
gradients^  c,(x*)  are  linearly  independent,  thenx*  is  a  KKT  point  for  the  problem  (17.1).  For 
such  points,  we  have  for  any  infinite  subsequence  K,  such  that  limkeic  xk  —  x*  that 

lim  —  PkCi(xk)  —  7-*,  for  all  i  €  £,  (17.10) 

keJC 

where  X*  is  the  multiplier  vector  that  satisfies  the  KKT  conditions  (12.34)  for  the  equality- 
constrained  problem  (17.1). 

Proof.  By  differentiating  Q{x;  pk)  in  (17.2),  we  obtain 

Vv2(x/t;  pk)  =  V/(xi)  +  y ^PkCj(xk)'Vci(xk), 

ieS 


(17.11) 
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so  from  the  termination  criterion  for  Framework  17.1,  we  have  that 


V/(x*)  +  y'jikCj(xk)S7ci(xk) 

ie£ 


<  rk. 


(17.12) 


By  rearranging  this  expression  (and  in  particular  using  the  inequality  1 1  a  1 1  —  1 1  h  1 1  <  ||a  +  h||), 
we  obtain 


y ^,ci(xk)'Vci(xk) 

ie£ 


<  —  [rk  +  ||V/(x*)l|] . 


(17.13) 


Let  x*  be  a  limit  point  of  the  sequence  of  iterates.  Then  there  is  a  subsequence  1C  such 
that  linu-gK:  xk  —  x*.  When  we  take  limits  as  k  — >  oo  for  k  e  /C,  the  bracketed  term  on  the 
right-hand-side  approaches  ||  V/(x*)  || ,  so  because  jik  f  oo,  the  right-hand-side  approaches 
zero.  From  the  corresponding  limit  on  the  left-hand-side,  we  obtain 

y>(x*)VCi(x*)  =  0.  (17.14) 

ieS 


We  can  have  c,(x*)  ^  0  (if  the  constraint  gradients  Vc,  (x*)  are  dependent),  but  in  this  case 
(17.14)  implies  thatx*  is  a  stationary  point  of  the  function  ||c(x)||2. 

If,  on  the  other  hand,  the  constraint  gradients  Vc,  (x*)  are  linearly  independent  at  a 
limit  point  x*,  we  have  from  (17.14)  that  c,  (x*)  =  0  for  all  i  e  8,  so  x*  is  feasible.  Hence, 
the  second  KKT  condition  (12.34b)  is  satisfied.  We  need  to  check  the  first  KKT  condition 
(12.34a)  as  well,  and  to  show  that  the  limit  (17.10)  holds. 

By  using  A(x)  to  denote  the  matrix  of  constraint  gradients  (also  known  as  the 
Jacobian),  that  is, 


A(x)T  —  [Vc;(x)]ie£  ,  (17.15) 

and  Xk  to  denote  the  vector  ~/xkc(xk),  we  have  as  in  ( 17.12)  that 

A{xk)rXk  =  V/(xi)  -  VxQ(xk\  fik),  ||Vxg(X(t;  fik)\\  <  rk.  (17.16) 

For  all  k  e  1C  sufficiently  large,  the  matrix  A(xk)  has  full  row  rank,  so  that  A(xk)A(xk)T  is 
nonsingular.  By  multiplying  (17. 16)  by  A(xk)  and  rearranging,  we  have  that 

Xk  =  [A{xk)A(xk)T]  1  A(xk)  [V/(x/t)  -  VxQ(xk\  ixk)]  . 

Hence  by  taking  the  limit  as  k  e  1C  goes  to  oo,  we  find  that 

lim Xk  —  X*  ~  [A(x*)A(x*)r]“1  A(x*)V/(x*). 
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By  taking  limits  in  (17.12),  we  conclude  that 


V/(x*)  -  A{x*)TX*  =  0,  (17.17) 

so  that  X*  satisfies  the  first  KKT  condition  (12.34a)  for  (17.1).  Hence,  x*  is  a  KKT  point  for 
(17.1),  with  unique  Lagrange  multiplier  vector  A*.  □ 

It  is  reassuring  that,  if  a  limit  point  x*  is  not  feasible,  it  is  at  least  a  stationary  point  for 
the  function  ||  c(x )  || 2.  Newton-type  algorithms  can  always  be  attracted  to  infeasible  points  of 
this  type.  (We  see  the  same  effect  in  Chapter  1 1,  in  our  discussion  of  methods  for  nonlinear 
equations  that  use  the  sum-of-squares  merit  function  ||r(x)||2.)  Such  methods  cannot  be 
guaranteed  to  find  a  root,  and  can  be  attracted  to  a  stationary  point  or  minimizer  of  the 
merit  function.  In  the  case  in  which  the  nonlinear  program  (17.1)  is  infeasible,  we  often 
observe  convergence  of  the  quadratic-penalty  method  to  stationary  points  or  minimizers  of 
||c(x)||2. 

ILL  CONDITIONING  AND  REFORMULATIONS 

We  now  examine  the  nature  of  the  ill  conditioning  in  the  Hessian  VxxQ(x\  /J,k). 
An  understanding  of  the  properties  of  this  matrix,  and  the  similar  Hessians  that  arise  in 
other  penalty  and  barrier  methods,  is  essential  in  choosing  effective  algorithms  for  the 
minimization  problem  and  for  the  linear  algebra  calculations  at  each  iteration. 

The  Hessian  is  given  by  the  formula 

V2xg(x;  /xk)  —  V2/(x)  +  £>*c,(*)V2c,(*)  +  /xkA(x)T A(x),  (17.18) 

ie€ 


where  we  have  used  the  definition  (17.15)  of  A(x).  When  x  is  close  to  the  minimizer  of 
Q(-\  fxk)  and  the  conditions  of  Theorem  17.2  are  satisfied,  we  have  from  (17.10)  that  the 
sum  of  the  first  two  terms  on  the  right-hand-side  of  (17.18)  is  approximately  equal  to  the 
Hessian  of  the  Lagrangian  function  defined  in  ( 12.33).  To  be  specific,  we  have 

V2xe(x;  fik)  »  V2X£(X,  n  +  fxkA(x)TA(x),  (17.19) 

when  x  is  close  to  the  minimizer  of  Q(-\  /xk).  We  see  from  this  expression  that  V2X  Q(x\  Hk) 
is  approximately  equal  to  the  sum  of 

-  a  matrix  whose  elements  are  independent  of  /xk  (the  Lagrangian  term),  and 

-  a  matrix  of  rank  \£\  whose  nonzero  eigenvalues  are  of  order  jxk  (the  second  term  on 
the  right-hand  side  of  (17.19)). 

The  number  of  constraints  \£  \  is  usually  smaller  than/?.  In  this  case,  the  last  term  in  (17.19)  is 
singular.  The  overall  matrix  has  some  of  its  eigenvalues  approaching  a  constant,  while  others 
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are  of  order  /x^.  Since  Hk  is  approaching  oo,  the  increasing  ill  conditioning  of  Q{x\  Hk) 
is  apparent. 

One  consequence  of  the  ill  conditioning  is  possible  inaccuracy  in  the  calculation  of 
the  Newton  step  for  Q{x\  Hk),  which  is  obtained  by  solving  the  following  system: 

V^g(x;  Hk)p  =  -V*<2(x;  fik).  (17.20) 

In  general,  the  poor  conditioning  of  this  system  will  lead  to  significant  errors  in  the  computed 
value  of  p,  regardless  of  the  computational  technique  used  to  solve  (17.20).  For  the  same 
reason,  iterative  methods  can  be  expected  to  perform  poorly  unless  accompanied  by  a 
preconditioning  strategy  that  removes  the  systematic  ill  conditioning. 

There  is  an  alternative  formulation  of  the  equations  (17.20)  that  avoids  the  ill  condi¬ 
tioning  due  to  the  final  term  in  (17.18).  By  introducing  a  new  variable  vector  £  defined  by 
f  =  nA{x)p,  we  see  that  the  vector  p  that  solves  (17.20)  also  satisfies  the  following  system: 

~  y2f(x)  +  J2^kCi(x)S/2Ci(x)  A(x)T  1  T  p  I  [  -VxQ{x\nt)  " 

ie£  = 

A(x)  —(l/Hk)I  _  L  ^  -I  L  0 

(17.21) 

When  x  is  not  too  far  from  the  solution  x* ,  the  coefficient  matrix  in  this  system  does  not  have 
large  singular  values  (of  order  /x^),  so  the  system  (17.21)  can  be  viewed  as  a  well  conditioned 
reformulation  of  (17.20).  We  note,  however,  that  neither  system  may  yield  a  good  search 
direction  p  because  the  coefficients  /zac,-  (x )  in  the  summation  term  of  the  upper  left  block  of 
( 17.21)  maybe  poor  approximations  to  the  Lagrange  multipliers  —X*,  even  when  x  is  quite 
close  to  the  minimizer  xa  of  Q(x\  /x*).  This  fact  may  cause  the  quadratic  model  on  which  p 
is  based  to  be  an  inadequate  model  of  Q{-\  /za),  so  the  Newton  step  may  be  intrinsically  an 
unsuitable  search  direction.  We  discussed  possible  remedies  for  this  difficulty  above,  in  our 
comments  following  Framework  17.1. 

To  compute  the  step  via  (17.21)  involves  the  solution  of  a  linear  system  of  dimension 
n  +  |  £  |  rather  than  the  system  of  dimension  n  given  by  (17.19).  A  similar  system  must 
be  solved  to  calculate  the  sequential  quadratic  programming  (SQP)  step  (18.6),  which  is 
derived  in  Chapter  18.  In  fact,  when  /xa  is  large,  (17.21)  can  be  viewed  as  a  regularization  of 
the  SQP  step  (18.6)  in  which  the  term  —  (1  / Hk)I  helps  to  ensure  that  the  iteration  matrix  is 
nonsingular  even  when  the  Jacobian  A(x)  is  rank  deficient.  On  the  other  hand,  when  /za  is 
small,  (17.21)  shows  that  the  step  computed  by  the  quadratic  penalty  method  does  not  closely 
satisfy  the  linearization  of  the  constraints.  This  situation  is  undesirable  because  the  steps 
may  not  make  significant  progress  toward  the  feasible  region,  resulting  in  inefficient  global 
behavior.  Moreover,  if  {/x*}  does  not  approach  oo  rapidly  enough,  we  lose  the  possibility  of 
a  superlinear  rate  that  occurs  when  the  linearization  is  exact;  see  Chapter  18. 
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To  conclude,  the  formulation  (17.21)  allows  us  to  view  the  quadratic  penalty  method 
either  as  the  application  of  unconstrained  minimization  to  the  penalty  function  Q{-\  pk) 
or  as  a  variation  on  the  SQP  methods  discussed  in  Chapter  18. 


17.2  NONSMOOTH  PENALTY  FUNCTIONS 


Some  penalty  functions  are  exact,  which  means  that,  for  certain  choices  of  their  penalty 
parameters,  a  single  minimization  with  respect  to  x  can  yield  the  exact  solution  of  the  non¬ 
linear  programming  problem.  This  property  is  desirable  because  it  makes  the  performance 
of  penalty  methods  less  dependent  on  the  strategy  for  updating  the  penalty  parameter.  The 
quadratic  penalty  function  of  Section  17.1  is  not  exact  because  its  minimizer  is  generally 
not  the  same  as  the  solution  of  the  nonlinear  program  for  any  positive  value  of  p.  In  this 
section  we  discuss  nonsmooth  exact  penalty  functions,  which  have  proved  to  be  useful  in  a 
number  of  practical  contexts. 

A  popular  nonsmooth  penalty  function  for  the  general  nonlinear  programming 
problem  (17.6)  is  the  £]  penalty  function  defined  by 

<Pdx;  p)  =  f{x)  +  fi^2\ajx)\  +  (17.22) 

ieS  iel 


where  we  use  again  the  notation  [y]_  =  max{0,  —  y}.  Its  name  derives  from  the  fact  that  the 
penalty  term  is  p  times  the  i\  norm  of  the  constraint  violation.  Note  that  <f>i(x\  p)  is  not 
differentiable  at  some  x,  because  of  the  presence  of  the  absolute  value  and  [•]"  functions. 

The  following  result  establishes  the  exactness  of  the  £  i  penalty  function.  For  a  proof 
see  [165,  Theorem  4.4], 

Theorem  1 7.3. 

Suppose  that  x*  is  a  strict  local  solution  of  the  nonlinear  programming  problem  (17.6) 
at  which  the  first-order  necessary  conditions  of  Theorem  12.1  are  satisfied,  with  Lagrange 
multipliers  X*,  i  e  £  U 1.  Then  x*  is  a  local  minimizer  of<p\  (x;  p)  for  all  p  >  p*,  where 

P*  =  Halloo  =  max  |k*|.  (17.23) 

i  €.£1)1 

If,  in  addition,  the  second-order  sufficient  conditions  of  Theorem  12.6  hold  and  p  >  p*,  then 
x*  is  a  strict  local  minimizer  of(p\(x\  p). 

Loosely  speaking,  at  a  solution  of  the  nonlinear  program  x*,  any  move  into  the 
infeasible  region  is  penalized  sharply  enough  that  it  produces  an  increase  in  the  penalty 
function  to  a  value  greater  than  </>i(x*;  p)  =  /(x*),  thereby  forcing  the  minimizer  of 
<p i(-;  p)  to  lie  at  x*. 
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□  Example  1 7.2 

Consider  the  following  problem  in  one  variable: 

minx  subject  to  x  >  1,  (17.24) 

whose  solution  is  x*  =  1.  We  have  that 


0i(x;  p)  —  x  +  fi[x  —  1]  = 


(1  —  p)x  +  ix  ifx  <  1, 
x  ifx  >  1. 


(17.25) 


As  can  be  seen  in  Figure  17.3,  the  penalty  function  has  a  minimizer  at  x*  =  1  when  /z  >  1, 
but  is  a  monotone  increasing  function  when  /x  <  1. 


Since  penalty  methods  work  by  minimizing  the  penalty  function  directly,  we  need  to 
characterize  stationary  points  of  cf>\ .  Even  though  <f>i  is  not  differentiable,  it  has  a  directional 
derivative  D((/)1  (x;  /x);  p)  along  any  direction;  see  (A.51)  and  the  example  following  this 
definition. 

Definition  17.1. 

Apointx  e  Rn  is  a  stationary  point  for  the  penalty  function  </>i(x;  /z)  if 


D((pi(x\  /z);  p)  >  0, 


(17.26) 


Figure  17.3  Penalty  function  for  problem  (17.24)  with  p  >  1  (left)  and  /z  <  1 
(right). 
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for  all  p  e  Rn .  Similarly,  x  is  a  stationary  point  of  the  measure  of  infeasibility 

h(x)  -  ^2  M*)|  +  (17.27) 

ieS  iel 


ifD(h(x );  p)  >  0  for  all  p  e  Rn .  If  a  point  is  infeasible  for  (17.6)  but  stationary  with  respect 
to  the  infeasibility  measure  h,  we  say  that  it  is  an  infeasible  stationary  point. 


For  the  function  in  Example  17.2,  we  have  for  x*  —  1  that 


D(<j> i(x*;  /x);  p)  — 


P  if  p  >  0 

(1  -  p)p  if  p  <  0; 


it  follows  that  when  /x  >  1,  we  have  D((pi(x*\  /x);  p)  >  0  for  all  p  e  R. 

The  following  result  complements  Theorem  1 7.3  by  showing  that  stationary  points  of 
(pi  (x\  pt)  correspond  to  KKT  points  of  the  constrained  optimization  problem  (17.6)  under 
certain  assumptions. 


Theorem  1 7.4. 

Suppose  thatx  is  a  stationary  point  of  the  penalty  function  <p\ (x;  /x)  for  all  /x  greater  than 
a  certain  threshold  jl  >  0.  Then,  if.x  is  feasible  for  the  nonlinear  program  (17.6),  it  satisfies  the 
KKT  conditions  (12.34)  for  (17.6).  Is  x  is  not  feasible  for  (17.6),  it  is  an  infeasible  stationary 
point. 

Proof.  Suppose  first  that  x  is  feasible.  We  have  from  (A.51)  and  the  definition  (17.22)  of 
<pi  that 

D((pi(x\  /x);  p)  —  Vf(x)Tp  +  p  ^ \Vci(x)Tp\  +  p  ^  [Vci(x)7’p]  ,  (17.28) 

i€S  ieirA(x) 

where  the  active  set  ^l(Jc)  is  defined  in  Definition  12.1.  (We  leave  verification  of  (17.28) 
as  an  exercise.)  Consider  any  direction  p  in  the  linearized  feasible  direction  set  LF(x)  of 
Definition  12.3.  By  the  properties  of  T(x),  we  have 

\Va(x)Tp\+  ^  [ Va(x)Tp ]  =0, 

ieinA(x) 

so  that  by  the  stationarity  assumption  on  (pi  (i;  /x),  we  have 


0  <  D((pi(x\  /x);  p)  —  Vf(x)Tp,  for  all  p  e  T(x). 
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Figure  17.4  Contours  of  <pi(x\  n)  from  (17.3)  for  fx  —  2,  contour  spacing  0.5. 


We  can  now  apply  Farkas’  Lemma  (Lemma  12.4)  to  deduce  that 

V/(x)=  *iVci(ic), 
ieA(x) 


for  some  coefficients  X,  with  X ,•  >  0  for  all  i  e  Ifl  A(x).  As  we  noted  earlier  (see 
Theorem  12.1  and  (12.35)),  this  expression  implies  that  the  KKT  conditions  (12.34)  hold, 
as  claimed. 

We  leave  the  second  part  of  the  proof  (concerning  infeasible  x)  as  an  exercise.  □ 


□  Example  1 7.3 

Consider  again  problem  (17.3),  for  which  the  l\  penalty  function  is 

(pi (x;  fi)  —  X\  +  X2  +  li  \x{  +  x\  —  2|  .  (17.29) 

Figure  17.4  plots  the  function  (pi{x\  2),  whose  minimizer  is  the  solution  x*  —  (—1,  —  l)r  of 
( 17.3).  In  fact,  following  Theorem  17.3,  we  find  that  for  all  [i  >  |A.*|  =  0.5,  the  minimizer 
of  (p i(x;  /x)  coincides  with  x*.  The  sharp  corners  on  the  contours  indicate  nonsmoothness 
along  the  boundary  of  the  circle  defined  by  x\  +  x\  =  2. 
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These  results  provide  the  motivation  for  an  algorithmic  framework  based  on  the  l\ 
penalty  function,  which  we  now  present. 


Framework  17.2  (Classical  ix  Penalty  Method). 

Given  /zo  >  0,  tolerance  r  >  0,  starting  point  xs0\ 
for  k  —  0,  1,  2,  . . . 

Find  an  approximate  minimizer  ay  of  <pi{x\  /x*),  starting  at  xsk; 
if  h{xk)  <  r 

stop  with  approximate  solution  xk; 

end  (if) 

Choose  new  penalty  parameter  /zy+i  >  f^k', 

Choose  new  starting  point  xk+v 

end  (for) 


The  minimization  of  (f>i(x\  yUy )  is  made  difficult  by  the  nonsmoothness  of  the  function. 
Nevertheless,  as  we  discuss  below,  it  is  well  understood  how  to  compute  minimization  steps 
using  a  smooth  model  of  (j>i{x\  fJ-k),  in  a  way  that  resembles  SQP  methods. 

The  simplest  scheme  for  updating  the  penalty  parameter  /zj.  is  to  increase  it  by  a 
constant  multiple  (say  5  or  10),  if  the  current  value  produces  a  minimizer  that  is  not  feasible 
to  within  the  tolerance  r.  This  scheme  sometimes  works  well  in  practice,  but  can  also  be 
inefficient.  If  the  initial  penalty  parameter  /z0  is  too  small,  many  cycles  of  Framework  17.2 
may  be  needed  to  determine  an  appropriate  value.  In  addition,  the  iterates  may  move  away 
from  the  solution  x*  in  these  initial  cycles,  in  which  case  the  minimization  of  <pi(x\  fik) 
should  be  terminated  early  and  xsk  should  possibly  be  reset  to  a  previous  iterate.  If,  on 
the  other  hand,  ;Uy  is  excessively  large,  the  penalty  function  will  be  difficult  to  minimize, 
possibly  requiring  a  large  number  of  iterations.  We  return  to  the  issue  of  selecting  the  penalty 
parameter  below. 


A  PRACTICAL  lx  PENALTY  METHOD 

As  noted  already,  </>i(x;  /z)  is  nonsmooth — its  gradient  is  not  defined  at  any  x  for 
which  Cj(x)  —  0  for  some  i  e  £  U  I.  Rather  than  using  techniques  for  nondifferen- 
tiable  optimization,  such  as  bundle  methods  [170],  we  prefer  techniques  that  take  account 
of  the  special  nature  of  the  nondifferentiabilities  in  this  function.  As  in  the  algorithms 
for  unconstrained  optimization  discussed  in  the  first  part  of  this  book,  we  obtain  a  step 
toward  the  minimizer  of  (p\{x\  /z)  by  forming  a  simplified  model  of  this  function  and 
seeking  the  minimizer  of  this  model.  Flere,  the  model  can  be  defined  by  linearizing 
the  constraints  c,-  and  replacing  the  nonlinear  programming  objective  /  by  a  quadratic 
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function,  as  follows: 

q(p\  p)  =  /(*)  +  V f[x)T p+  \pTWp  +  n^2\Ci(x)  +  Va(x)T p\  + 

ie£ 

nTfcdx)  +  Vcj(x)Tp]~,  (17.30) 

iel 


where  W  is  a  symmetric  matrix  which  usually  contains  second  derivative  information  about 
/  and  Cj,  i  e  8  U  X.  The  model  q{p ;  /x)  is  not  smooth,  but  we  can  formulate  the  problem 
of  minimizing  q  as  a  smooth  quadratic  programming  problem  by  introducing  artificial 
variables  r,- ,  Sj ,  and  t ,• ,  as  follows: 

min  f(x)  +  \pTWp  +  V  f(x)T p  +  /xV' (n  +  st)  +  /x  U 

ieS  iel 

subject  to  Vci(x)T  p  +  c,(x)  =  r y  —  jf,  ie£ 

Vcj(x)Tp  +  c,(x)  >  —  tj,  iel  (17.31) 

r,  s,  t  >  0. 

This  subproblem  can  be  solved  with  a  standard  quadratic  programming  solver.  Even  after 
addition  of  a  “box-shaped”  trust  region  constraint  of  the  form  ||  /? || oo  <  A,  it  remains  a 
quadratic  program.  This  approach  to  minimizing  4>\  is  closely  related  to  sequential  quadratic 
programming  (SQP)  and  will  be  discussed  further  in  Chapter  18. 

The  strategy  for  choosing  and  updating  the  penalty  parameter  /x^  is  crucial  to  the 
practical  success  of  the  iteration.  We  mentioned  that  a  simple  (but  not  always  effective) 
approach  is  to  choose  an  initial  value  and  increase  it  repeatedly  until  feasibility  is  attained. 
In  some  variants  of  the  approach,  the  penalty  parameter  is  chosen  at  every  iteration  so  that 
P-k  >  II  A*  || oo?  where  A.*  is  an  estimate  of  the  Lagrange  multipliers  computed  at  Xk-  We  base 
this  strategy  on  Theorem  17.2,  which  suggests  that  in  a  neighborhood  of  a  solution  x*,  a 
good  choice  would  be  to  set  p±  modestly  larger  than  ||A*||oo-  This  strategy  is  not  always 
successful,  as  the  multiplier  estimates  may  be  inaccurate  and  may  in  any  case  not  provide  a 
good  appropriate  value  of  /x*  far  from  the  solution. 

The  difficulties  of  choosing  appropriate  values  of  /x,;.  caused  nonsmooth  penalty 
methods  to  fall  out  of  favor  during  the  1990s  and  stimulated  the  development  of  filter 
methods,  which  do  not  require  the  choice  of  a  penalty  parameter  (see  Section  15.4).  In 
recent  years,  however,  there  has  been  a  resurgence  of  interest  in  penalty  methods,  in  part 
because  of  their  ability  to  handle  degenerate  problems.  New  approaches  for  updating  the 
penalty  parameter  appear  to  have  largely  overcome  the  difficulties  associated  with  choosing 
Pky  at  least  for  some  particular  implementations  (see  Algorithm  18.5). 

Careful  consideration  should  also  be  given  to  the  choice  of  starting  point  X£+l  for  the 
minimization  of  <p\  {x ;  /x*+ 1 ) .  If  the  penalty  parameter  p,k  for  the  present  cycle  is  appropriate, 
in  the  sense  that  the  algorithm  made  progress  toward  feasibility,  then  we  can  set  Xk+ 1  to  be 
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the  minimizer  Xk  of  <pi  (x ;  fXk)  obtained  on  this  cycle.  Otherwise,  we  may  want  to  restore  the 
initial  point  from  an  earlier  cycle. 

A  GENERAL  CLASS  OF  NONSMOOTH  PENALTY  METHODS 

Exact  nonsmooth  penalty  functions  can  be  defined  in  terms  of  norms  other  than  the 
l\  norm.  We  can  write 

fix;  n)  =  f{x)  +  n\\ce{x)\\  +  fi\\[cxix)]-\\,  (17.32) 

where  ||  •  ||  is  any  vector  norm,  and  all  the  equality  and  inequality  constraints  have  been 
grouped  in  the  vector  functions  eg  and  cj,  respectively.  Framework  17.2  applies  to  any  of 
these  penalty  functions;  we  simply  redefine  the  measure  of  infeasibility  as  h{x)  —  ||cf(x)||  + 
II  [ci(x)]-|| .  The  most  common  norms  used  in  practice  are  the i i,  and  (not  squared). 
It  is  easy  to  find  a  reformulation  similar  to  (17.31)  for  the  norm. 

The  theoretical  properties  described  for  the  l\  function  extend  to  the  general  class 
(17.32).  In  Theorem  17.3,  we  replace  the  inequality  (17.23)  by 


/x*  =  ||  A.*  ||  d,  (17.33) 

where  ||  •  ||z>  is  the  dual  norm  of  ||  •  ||,  defined  in  (A.6).  Theorem  17.4  applies  without 
modification. 

We  show  now  that  penalty  functions  of  the  type  considered  so  far  in  this  chapter  must 
be  nonsmooth  to  be  exact.  For  simplicity,  we  restrict  our  attention  to  the  case  when  there  is 
a  single  equality  constraint  Ci(x)  =  0,  and  consider  a  penalty  function  of  the  form 

fix;  fx)  =  fix)  +  [xli(ciix)),  (17.34) 

where  h  :  R  — >  R  is  a  function  satisfying  the  properties  h{y)  >  0  for  all  y  e  Rand/z(0)  =  0. 
Suppose  for  contradiction  that  h  is  continuously  differentiable.  Since  h  has  a  minimizer  at 
zero,  we  have  from  Theorem  2.2  that  V/?  ( 0)  =  0.  If  x*  is  a  local  solution  of  the  problem 
(17.6),  we  have  ci(x*)  =  0  and  therefore  V/?(ci(x*))  =  0.  If  x*  is  a  local  minimizer  of 
</>(x;  /x  ),  we  therefore  have 

0  =  V(/>(x*;  ix)  =  V fix*)  +  /U.Vci(x*)V/i(ci(x*))  =  V/(x*). 

Flowever,  it  is  not  generally  true  that  the  gradient  of  /  vanishes  at  the  solution  of  a  constrained 
optimization  problem,  so  our  original  assumption  that  h  is  continuously  differentiable  must 
be  incorrect,  and  <pf\  /x)  cannot  be  smooth. 

Nonsmooth  penalty  functions  are  also  used  as  merit  functions  in  methods  that  compute 
steps  by  some  other  mechanism.  For  further  details  see  the  general  discussion  of  Section  15.4 
and  the  concrete  implementations  given  in  Chapters  18  and  19. 
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17.3  AUGMENTED  LAGRANGIAN  METHOD:  EQUALITY 
CONSTRAINTS 


We  now  discuss  an  approach  known  as  the  method  of  multipliers  or  the  augmented  Lagrangian 
method.  This  algorithm  is  related  to  the  quadratic  penalty  algorithm  of  Section  17.1,  but 
it  reduces  the  possibility  of  ill  conditioning  by  introducing  explicit  Lagrange  multiplier 
estimates  into  the  function  to  be  minimized,  which  is  known  as  the  augmented  Lagrangian 
function.  In  contrast  to  the  penalty  functions  discussed  in  Section  17.2,  the  augmented 
Lagrangian  function  largely  preserves  smoothness,  and  implementations  can  be  constructed 
from  standard  software  for  unconstrained  or  bound-constrained  optimization. 

In  this  section  we  use  superscripts  (usually  k  and  k  +  1)  on  the  Lagrange  multiplier 
estimates  to  denote  iteration  index,  and  subscripts  (usually  i)  to  denote  the  component 
indices  of  the  vector  X.  For  all  other  variables  we  use  subscripts  for  the  iteration  index,  as 
usual. 


MOTIVATION  AND  ALGORITHMIC  FRAMEWORK 

We  consider  first  the  equality-constrained  problem  (17.1).  The  quadratic  penalty 
function  Q(x\  p)  defined  by  (17.2)  penalizes  constraint  violations  by  squaring  the  infeasi¬ 
bilities  and  scaling  them  by  p/2.  As  we  see  from  Theorem  17.2,  however,  the  approximate 
minimizers  Xk  of  Q(x ;  pk)  do  not  quite  satisfy  the  feasibility  conditions  c,(x)  =  0, e  £. 
Instead,  they  are  perturbed  (see  (17.10))  so  that 

Ci{xk)  ~  —X*/pk,  for  all/  e  £.  (17.35) 

To  be  sure,  we  have  Cj(xk)  — >  0  as  pk  t  00 >  but  one  may  ask  whether  we  can  alter  the 
function  Q(x\  pk )  to  avoid  this  systematic  perturbation — that  is,  to  make  the  approximate 
minimizers  more  nearly  satisfy  the  equality  constraints  c,-  (x)  =  0,  even  for  moderate  values 
of  Pk- 

The  augmented  Lagrangian  function  Ca(x,  X\  p)  achieves  this  goal  by  including  an 
explicit  estimate  of  the  Lagrange  multipliers  X,  based  on  the  estimate  ( 17.35),  in  the  objective. 
From  the  definition 


CA(x,X;  p)  =  f{x)  -^Xidix)  +  y  ^c,2(x),  (17.36) 

ie£  ie£ 


we  see  that  the  augmented  Lagrangian  differs  from  the  (standard)  Lagrangian  (12.33)  for 
(17.1)  by  the  presence  of  the  squared  terms,  while  it  differs  from  the  quadratic  penalty 
function  (17.2)  in  the  presence  of  the  summation  term  involving  X.  In  this  sense,  it  is  a 
combination  of  the  Lagrangian  function  and  the  quadratic  penalty  function. 

We  now  design  an  algorithm  that  fixes  the  penalty  parameter  p  to  some  value  pk  >  0 
at  its  kfh  iteration  (as  in  Frameworks  17.1  and  17.2),  fixes  X  at  the  current  estimate  Xk,  and 
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performs  minimization  with  respect  to  x.  Using  Xk  to  denote  the  approximate  minimizer 
of  Ca{x,  Xk;  fj.k),  we  have  by  the  optimality  conditions  for  unconstrained  minimization 
(Theorem  2.2)  that 

0  VXCA (xk,  Xk;  lik)  —  V  f{xk)  -  -  HkCi(xk)]Vci(xk).  (17.37) 

ieS 


By  comparing  with  the  optimality  condition  ( 17.17)  for  (17.1),  we  can  deduce  that 

X*  Xk  —  fkkCi(xk),  for  all  i  e  £.  (17.38) 

By  rearranging  this  expression,  we  have  that 

Ci(xk)  & - (X*  —  Xk),  for  all  i  e  £, 

l^k 

so  we  conclude  that  if  Xk  is  close  to  the  optimal  multiplier  vector  A* ,  the  infeasibility  in  Xk  will 
be  much  smaller  than  (1  / yu-*),  rather  than  being  proportional  to  (l//x*)  as  in  (17.35).  The 
relation  (17.38)  immediately  suggests  a  formula  for  improving  our  current  estimate  Xk  of  the 
Lagrange  multiplier  vector,  using  the  approximate  minimizer  Xk  just  calculated:  We  can  set 

—  Xk  —  HkCi(xk),  for  all;  e  £.  (17.39) 


This  discussion  motivates  the  following  algorithmic  framework. 

Framework  17.3  (Augmented  Lagrangian  Method-Equality  Constraints). 

Given  /t0  >  0,  tolerance  r0  >  0,  starting  points  xsQ  and  1°; 
for  k  —  0,  1,  2,  . . . 

Find  an  approximate  minimizer  Xk  of  CA(-,  Xk;  fj.k),  starting  at  xsk, 
and  terminating  when  ||Vt£^(xi,  Xk\  /xa) ||  <  T k\ 
if  a  convergence  test  for  (17.1)  is  satisfied 
stop  with  approximate  solution  x^, 

end  (if) 

Update  Lagrange  multipliers  using  (17.39)  to  obtain  Xk+l; 

Choose  new  penalty  parameter  fXk+i  >  M/t; 

Set  starting  point  for  the  next  iteration  to  x^+1  =  Xk', 

Select  tolerance  Tk+ii 

end  (for) 

We  show  below  that  convergence  of  this  method  can  be  assured  without  increasing 
fi  indefinitely.  Ill  conditioning  is  therefore  less  of  a  problem  than  in  Framework  17.1,  so 
the  choice  of  starting  point  Xk+ 1  in  Framework  17.3  is  less  critical.  (In  Framework  17.3  we 
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Figurel7.5  Contours  of  CA(x,  A;  /x)  from  (17.40)  fori  =  — 0.4and/z  =  1, contour 
spacing  0.5. 


simply  start  the  search  at  iteration  k+  1  from  the  previous  approximate  minimizer  x*.)  The 
tolerance  r k  could  be  chosen  to  depend  on  the  infeasibility  J2ie£  lc(*fc)l>  and  the  penalty 
parameter  / 1  may  be  increased  if  the  reduction  in  this  infeasibility  measure  is  insufficient  at 
the  present  iteration. 


□  Example  1 7.4 

Consider  again  problem  (17.3),  for  which  the  augmented  Lagrangian  is 


CA{x,  A.;  n)  —  X\  +  x2  —  A(x2  +  x\  —  2)  +  —  (x2  +  x2  —  2)2.  (17.40) 

The  solution  of  (17.3)  isx*  =  (—1,  — l)r  and  the  optimal  Lagrange  multiplier  is  X*  —  —0.5. 

Suppose  that  at  iterate  k  we  have  /Xk  ~  1  (as  in  Figure  17.1),  while  the  current 
multiplier  estimate  is  Xk  —  —0.4.  Figure  17.5  plots  the  function  CA(x,  —0.4;  1).  Note  that 
the  spacing  of  the  contours  indicates  that  the  conditioning  of  this  problem  is  similar  to  that  of 
the  quadratic  penalty  function  Q(x\  1)  illustrated  in  Figure  17.1.  However,  the  minimizing 
value  of  Xk  (— 1.02,  — 1.02)r  is  much  closer  to  the  solution  x*  =  ( —  1,  —  l)r  than  is  the 
minimizing  value  of  <2(x;  1),  which  is  approximately  (—1.1,  — l.l)r.  This  example  shows 
that  the  inclusion  of  the  Lagrange  multiplier  term  in  the  function  CA(x,  A;  fj.)  can  result  in 
a  significant  improvement  over  the  quadratic  penalty  method,  as  a  way  to  reformulate  the 
constrained  optimization  problem  (17.1). 
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PROPERTIES  OF  THE  AUGMENTED  LAGRANGIAN 

We  now  prove  two  results  that  justify  the  use  of  the  augmented  Lagrangian  function 
and  the  method  of  multipliers  for  equality-constrained  problems. 

The  first  result  validates  the  approach  of  Framework  17.3  by  showing  that  when  we 
have  knowledge  of  the  exact  Lagrange  multiplier  vector  A*,  the  solution  x*  of  (17.1)  is  a 
strict  minimizer  of  £a(x,  A.*;  /z)  for  all  /z  sufficiently  large.  Although  we  do  not  know  A* 
exactly  in  practice,  the  result  and  its  proof  suggest  that  we  can  obtain  a  good  estimate  of 
x*  by  minimizing  £a(x,  A;  /z)  even  when  /z  is  not  particularly  large,  provided  that  A  is  a 
reasonably  good  estimate  of  A*. 

Theorem  1 7.5. 

Let  x*  be  a  local  solution  of  (17.1)  at  which  the  LICQ  is  satisfied  ( that  is,  the  gradients 
Vc,(x*),  i  e  £,  are  linearly  independent  vectors),  and  the  second-order  sufficient  conditions 
specified  in  Theorem  12.6  are  satisfied  for  A  =  A*.  Then  there  is  a  threshold  value  jl  such  that 
for  all  /z  >  jl,  x*  is  a  strict  local  minimizer  of  £A(x,  A*;  pc). 

Proof.  We  prove  the  result  by  showing  that  x*  satisfies  the  second-order  sufficient  condi¬ 
tions  to  be  a  strict  local  minimizer  of  £a(x,  A*;  p)  (see  Theorem  2.4)  for  all  /z  sufficiently 
large;  that  is, 


Vr£A(x*,  X*;  p)  —  0,  V2x£A(x*,  A*;  p)  positive  definite.  (17.41) 

Because  x*  is  a  local  solution  for  (17.1)  at  which  LICQ  is  satisfied,  we  can  apply  Theorem  12.1 
to  deduce  that  Vv£(x*,  A*)  =  0  and  c,  (x*)  =  0  for  all  i  e  £,  so  that 

V,£a(x*,  A*;  p)  =  V/(x*)  -  £[A *  -  pcfix*)] Vc;(x*) 

ieS 

=  V/(x*)  -  A.?Vc,-(x*)  =  Vv£(x*,  A*)  =  0, 

ieS 


verifying  the  first  part  of  (17.41),  independently  of  /z. 

For  the  second  part  of  (17.41),  we  define  A  to  be  the  constraint  gradient  matrix  in 
(17.15)  evaluated  at  x*,  and  write 

V2xx£a(x\  A*;  h)  =  V2xx£(x*,  A*)  +  pAT A. 

If  the  claim  in  (17.41)  were  not  true,  then  for  each  integer  k  >  1,  we  could  choose  a  vector 
Wk  with  ||  uj*  ||  =  1  such  that 


0  >  wTkV2xx£A(x*,  A*;  k)wk  =  wTkV2x£(x*,  X*)wk  +  k\\Awk\\22, 


(17.42) 
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and  therefore 

\\Awk\\l  < —(1/ k)w](V2xC(x* ,  X*)wk -*■  0,  ask  — >  oo.  (17.43) 

Since  the  vectors  {w*}  lie  in  a  compact  set  (the  surface  of  the  unit  sphere),  they  have  an 
accumulation  point  w.  The  limit  (17.43)  implies  that  Aw  —  0.  Moreover,  by  rearranging 
(17.42),  we  have  that 

wTkV2xxC(x*,X*)wk  <  -k\\Awk\\\  <  0, 

so  by  taking  limits  we  have  wTV2xC(x* ,  X*)w  <  0.  However,  this  inequality  contradicts  the 
second-order  conditions  in  Theorem  12.6  which,  when  applied  to  (17.1),  state  that  we  must 
have  wT VxxC(x* ,  X*)w  >  0  for  all  nonzero  vectors  w  with  Aw  =  0.  Hence,  the  second 
part  of  (17.41)  holds  for  all  p  sufficiently  large.  □ 

The  second  result,  given  by  Bertsekas  [19,  Proposition  4.2.3],  describes  the  more 
realistic  situation  of  A  X*.  It  gives  conditions  under  which  there  is  a  minimizer  of 
Ca(x,X ;  pi)  that  lies  close  to  x*  and  gives  error  bounds  on  both  xk  and  the  updated 
multiplier  estimate  A.<:+1  obtained  from  solving  the  subproblem  at  iteration  k. 

Theorem  1 7.6. 

Suppose  that  the  assumptions  of  Theorem  17.5  are  satisfied  at  x*  and  X *  and  let  fi  be 
chosen  as  in  that  theorem.  Then  there  exist  positive  scalars  S,  e,  and  M  such  that  the  following 
claims  hold: 

(a)  For  all  Xk  and  pk  satisfying 

lik>fi,  (17.44) 

the  problem 

min  Ca(x,  Xk ;  pk)  subject  to  ||x  —  x*||  <  e 
has  a  unique  solution  xk.  Moreover,  we  have 

||jcjfc  —  x*||  <  M\\Xk  —  X*\\/ fik.  (17.45) 

(b)  ForallXk  and  pk  that  satisfy  (17.44),  wehave 

|| Xk+1  -  X* ||  <  M||  A*  -  X*\\/tik,  (17.46) 


where  Xk+l  is  given  by  the  formula  (17.39). 
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(c)  For  all  Xk  and  pk  that  satisfy  (17.44),  the  matrix  V2XXLA  (xu,  Xk\  Pk )  is  positive  definite 
and  the  constraint  gradients  Vci  (.Xk),  i  e  £,  are  linearly  independent. 

This  theorem  illustrates  some  salient  properties  of  the  augmented  Lagrangian  ap¬ 
proach.  The  bound  ( 17.45)  shows  that  Xk  will  be  close  to  x*  if  Xk  is  accurate  or  if  the  penalty 
parameter  pk  is  large.  Hence,  this  approach  gives  us  two  ways  of  improving  the  accuracy 
of  Xk,  whereas  the  quadratic  penalty  approach  gives  us  only  one  option:  increasing  pk.  The 
bound  (17.46)  states  that,  locally,  we  can  ensure  an  improvement  in  the  accuracy  of  the 
multipliers  by  choosing  a  sufficiently  large  value  of  pk .  The  final  observation  of  the  theorem 
shows  that  second-order  sufficient  conditions  for  unconstrained  minimization  (see  Theo¬ 
rem  2.4)  are  also  satisfied  for  the  fcth  subproblem  under  the  given  conditions,  so  one  can 
expect  good  performance  by  applying  standard  unconstrained  minimization  techniques. 


1 7.4  PRACTICAL  AUGMENTED  LAGRANGIAN  METHODS 


In  this  section  we  discuss  practical  augmented  Lagrangian  procedures,  in  particular,  proce¬ 
dures  for  handling  inequality  constraints.  We  discuss  three  approaches  based,  respectively, 
on  bound-constrained,  linearly  constrained,  and  unconstrained  formulations.  The  first  two 
are  the  basis  of  the  successful  nonlinear  programming  codes  Lancelot  [72]  andMiNOS  [218]. 


BOUND-CONSTRAINED  FORMULATION 

Given  the  general  nonlinear  program  (17.6),  we  can  convert  it  to  a  problem  with 
equality  constraints  and  bound  constraints  by  introducing  slack  variables  Si  and  replacing 
the  general  inequalities  c,  (x)  >  0,  i  e  I ,  by 

Ci(x)  —  Si  =  0,  Si  >  0,  for  all  i  e  X.  (17.47) 

Bound  constraints,  l  <  x  <  u,  need  not  be  transformed.  By  reformulating  in  this  way,  we 
can  write  the  nonlinear  program  as  follows: 

min/(x)  subjectto  a(x)  —  0,  i  =  1,  2, . . . ,  m,  l<x<u.  (17.48) 


(The  slacks  s,-  have  been  incorporated  into  the  vector  x  and  the  constraint  functions  c,- 
have  been  redefined  accordingly.  We  have  numbered  the  constraints  consecutively  with 
i  —  1,2, ...  ,m  and  in  the  discussion  below  we  gather  them  into  the  vector  function 
c  :  R"  ->  R'".)  Some  of  the  components  of  the  lower  bound  vector  /  may  be  set  to  — oo, 
signifying  that  there  is  no  lower  bound  on  the  components  of  x  in  question;  similarly  for  u. 


520  Chapter  17.  Penalty  and  Augmented  Lagrangian  Methods 


The  bound-constrained  Lagrangian  (BCL)  approach  incorporates  only  the  equality 
constraints  from  (17.48)  into  the  augmented  Lagrangian,  that  is, 


CA{x,  X\  /x)  —  f(x)  -  2>c,(*)  +  y  £c,2(x).  (17.49) 

i=i  i=i 

The  bound  constraints  are  enforced  explicitly  in  the  subproblem,  which  has  the  form 

min  CA(x,  X;  /x)  subject  to  /<  x  <  u.  (17.50) 

a: 

After  this  problem  has  been  solved  approximately,  the  multipliers  X  and  the  penalty 
parameter  /x  are  updated  and  the  process  is  repeated. 

An  efficient  technique  for  solving  the  nonlinear  program  with  bound  constraints 
(17.50)  (for  fixed  /x  and  X)  is  the  (nonlinear)  gradient  projection  method  discussed  in 
Section  18.6.  By  specializing  the  KKT  conditions  (12.34)  to  the  problem  (17.50),  we  find 
that  the  first-order  necessary  condition  for  x  to  be  a  solution  of  ( 17.50)  is  that 


x  —  P  (x  —  Va.£4 (x,  X\  /x),  l,  u )  =  0,  (17.51) 

where  P{g,  /,  u)  is  the  projection  of  the  vector  g  e  R"  onto  the  rectangular  box  [/,  u] 
defined  as  follows 


P(g ,  l,  u)i  — 


U  if  gi  —  h , 

gi  if  gi  e  (lt,  u,  ), 

Uj  if  gi>Ui, 


for  all  i  —  1,2 , ,n. 


(17.52) 


We  are  now  ready  to  describe  the  algorithm  implemented  in  the  Lancelot  software  package. 


Algorithm  17.4  (Bound-Constrained  Lasransian  Method). 

Choose  an  initial  point  Xo  and  initial  multipliers  k°; 

Choose  convergence  tolerances  17*  and  &>*; 

Set  /xo  =  10,  (o o  =  1/no,  and  rj 0  =  V/x®1; 
for  k  —  0,  1,  2, . . . 

Find  an  approximate  solution  Xk  of  the  subproblem  (17.50)  such  that 
I**  -  P  (**  -  VxCA{xk,  Xk;  nk),  /,  u )  |  <  (Ok, 


if  ||c(jc*)||  <  rjk 

(*  test  for  convergence  *) 

if  l|c(**)||  <  and  ||**  -  P  (**  -  VX£A(**,  kk;  nk),  l,  u )  ||  <  to* 
stop  with  approximate  solution  X/; ; 
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end  (if) 

(*  update  multipliers,  tighten  tolerances  *) 

Xk+l  —  Xk  -  / ikc(xk ); 

^k+\  ~ 

Vk+ 1  =  Vk/nllv 
0>k+l  =  U>k/tAk+ll 

else 

(*  increase  penalty  parameter,  tighten  tolerances  *) 

A.^1  =  A*; 
ftfc+i  —  1 00/x^ , 

Vk+i  =  1/Mt+p 
®t-+i  =  1/ma+i; 
end  (if) 
end  (for) 

The  main  branch  in  the  algorithm  occurs  after  problem  (17.50)  has  been  solved 
approximately,  when  the  algorithm  tests  to  see  if  the  constraints  have  decreased  sufficiently, 
as  measured  by  the  condition 


l|c(**)ll  <  m-  (17.53) 

If  this  condition  holds,  the  penalty  parameter  is  not  changed  for  the  next  iteration  because 
the  current  value  of  jxk  is  producing  an  acceptable  level  of  constraint  violation.  The  Lagrange 
multiplier  estimates  are  updated  according  to  the  formula  (17.39)  and  the  tolerances  u>k  and 
%  are  tightened  in  advance  of  the  next  iteration.  If,  on  the  other  hand,  (17.53)  does  not 
hold,  then  we  increase  the  penalty  parameter  to  ensure  that  the  next  subproblem  will  place 
more  emphasis  on  decreasing  the  constraint  violations.  The  Lagrange  multiplier  estimates 
are  not  updated  in  this  case;  the  focus  is  on  improving  feasibility. 

The  constants  0.1,  0.9,  and  100  appearing  in  Algorithm  17.4  are  to  some  extent  arbi¬ 
trary;  other  values  can  be  used  without  compromising  theoretical  convergence  properties. 
Lancelot  uses  the  gradient  projection  method  with  trust  regions  (see  (18.61))  to  solve  the 
bound-constrained  nonlinear  subproblem  (17.50).  In  this  context,  the  gradient  projection 
method  constructs  a  quadratic  model  of  the  augmented  Lagrangian  CA  and  computes  a 
step  d  by  approximately  solving  the  trust  region  problem 

min  \dT  \y2xxC(xk,  Xk)  +  nkATkAk\  d  +  VxCA{xk,  Xk\  iik)T d  (17.54) 
subject  to  l  <xk  +  d  <  m,  ll^lloo  — 

where  Ak  —  A(xk)  and  A  is  a  trust  region  radius.  (We  can  formulate  the  trust-region 
constraint  by  means  of  the  bounds  —  Ae  <  d  <  Ae,  where  e  =  (1,  1,  ... ,  l)r.)  Each 
iteration  of  the  algorithm  for  solving  this  subproblem  proceeds  in  two  stages.  First,  a 
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projected  gradient  line  search  is  performed  to  determine  which  components  of  d  should  be 
set  at  one  of  their  bounds.  Second,  a  conjugate  gradient  iteration  minimizes  (17.54)  with 
respect  to  the  free  components  of  d — those  not  at  one  of  their  bounds.  Importantly,  this 
algorithm  does  not  require  the  factorizations  of  a  KKT  matrix  or  of  the  constraint  Jacobian 
Ak.  The  conjugate  gradient  iteration  only  requires  matrix- vector  products,  a  feature  that 
makes  Lancelot  suitable  for  large  problems. 

The  Hessian  of  the  Lagrangian  V^x£(xk,  Xk)  in  (17.54)  can  be  replaced  by  a  quasi- 
Newton  approximation  based  on  the  BFGS  or  SRI  updating  formulas.  Lancelot  is  designed 
to  take  advantage  of  partially  separable  structure  in  the  objective  function  and  constraints, 
either  in  the  evaluation  of  the  Hessian  of  the  Lagrangian  or  in  the  quasi-Newton  updates 
(see  Section  7.4). 

LINEARLY  CONSTRAINED  FORMULATION 

The  principal  idea  behind  linearly  constrained  Lagrangian  (LCL)  methods  is  to  generate 
a  step  by  minimizing  the  Lagrangian  (or  augmented  Lagrangian)  subject  to  linearizations  of 
the  constraints.  If  we  use  the  formulation  (17.48)  of  the  nonlinear  programming  problem, 
the  subproblem  used  in  the  LCL  approach  takes  the  form 

min  Fk(x)  (17.55a) 

X 

subjectto  c(xk)  +  Ak(x  —  xk)  —  0,  l<x<u.  (17.55b) 

There  are  several  possible  choices  for  Fk(x).  Early  LCL  methods  defined 


**(*)  = /(*)-£>?*?(*),  (17.56) 

1=1 

where  Xk  is  the  current  Lagrange  multiplier  estimate  and  ck(x)  is  the  difference  between 
Cj( x)  and  its  linearization  at  xk,  that  is, 

cf(x)  =  Ci(x)  -  Cj{xk)  -  Vaix, k)T(x  -  Xk).  (17.57) 

One  can  show  that  as  xk  converges  to  a  solution  x *,  the  Lagrange  multiplier  associated  with 
the  equality  constraint  in  (17.55b)  converges  to  the  optimal  multiplier.  Therefore,  one  can 
set  Xk  in  (17.56)  to  be  the  Lagrange  multiplier  for  the  equality  constraint  in  (17.55b)  from 
the  previous  iteration. 

Current  LCL  methods  define  Fk  to  be  the  augmented  Lagrangian  function 


Fk(x)  =  f{x)  ~^2xkck(x)  +  y  £[cf(*)]2. 
1  =  1  1  =  1 


(17.58) 
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This  definition  of  Fk  appears  to  yield  more  reliable  convergence  from  remote  starting  points 
than  does  (17.56),  in  practice. 

There  is  a  notable  similarity  between  (17.58)  and  the  augmented  Lagrangian  ( 17.36), 
the  difference  being  that  the  original  constraints  c,(x)  have  been  replaced  by  the  functions 
cf  (x),  which  capture  only  the  “second-order  and  above”  terms  ofc,- .  The  subproblem  (17.55) 
differs  from  the  augmented  Lagrangian  subproblem  in  that  the  new  x  is  required  to  satisfy 
exactly  a  linearization  of  the  equality  constraints,  while  the  linear  part  of  each  constraint  is 
factored  out  of  the  objective  via  the  use  of  ck-  in  place  of  c,  .  A  procedure  similar  to  the  one 
in  Algorithm  17.4  can  be  used  for  updating  the  penalty  parameter  /z  and  for  adjusting  the 
tolerances  that  govern  the  accuracy  of  the  solution  of  the  subproblem. 

Since  cf  (x)  has  zero  gradient  atx  =  xk,  we  have  that  V Fk{xk)  —  V /(x*),  where  Fk  is 
defined  by  either  ( 17.56)  or  ( 17.58).  We  can  also  show  that  the  Hessian  of  Fk  is  closely  related 
to  the  Hessians  of  the  Lagrangian  or  augmented  Lagrangian  functions  for  (17.1).  Because 
of  these  properties,  the  subproblem  (17.55)  is  similar  to  the  SQP  subproblems  described  in 
Chapter  18,  with  the  quadratic  objective  in  SQP  being  replaced  by  a  nonlinear  objective  in 
LCL. 

The  well  known  code  minos  [218]  uses  the  nonlinear  model  function  (17.58)  andsolves 
the  subproblem  via  a  reduced  gradient  method  that  employs  quasi-Newton  approximations 
to  the  reduced  Hessian  of  Fk.  A  fairly  accurate  solution  of  the  subproblem  is  computed  in 
minos  to  try  to  ensure  that  the  Lagrange  multiplier  estimates  for  the  equality  constraint 
in  (17.55b)  (subsequently  used  in  ( 17.58))  are  of  good  quality.  As  a  result,  minos  typically 
requires  more  evaluations  of  the  objective  /  and  constraint  functions  c,-  (and  their  gradients) 
in  total  than  SQP  methods  or  interior-point  methods.  The  total  number  of  subproblems 
(17.55)  that  are  solved  in  the  course  of  the  algorithm  is,  however,  sometimes  smaller  than 
in  other  approaches. 

UNCONSTRAINED  FORMULATION 

We  can  obtain  an  unconstrained  form  of  the  augmented  Lagrangian  subproblem 
for  inequality-constrained  problems  by  using  a  derivation  based  on  the  proximal  point 
approach.  Supposing  for  simplicity  that  the  problem  has  no  equality  constraints  (8  —  0), 
we  can  write  the  problem  (17.6)  equivalently  as  an  unconstrained  optimization  problem: 

min 

xeRn 

where 

F(x)  =  max  j/(x)  -  A,c,-(x)| 

To  verify  these  expressions  for  F,  consider  first  the  case  of  x  infeasible,  that  is,  c,(x)  <  0 
for  some  i.  We  can  then  choose  X,  arbitrarily  large  and  positive  while  setting  Xj  —  0  for  all 


Fix), 


(17.59) 


fix)  if  x  is  feasible, 

oo  otherwise. 


(17.60) 
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j  -f-  i,  to  verify  that  F(x)  is  infinite  in  this  case.  If  x  is  feasible,  we  have  c,(x)  >  0  for  all 
i  e  1,  so  the  maximum  is  attained  at  A  —  0,  and  F(x )  =  f(x)  in  this  case.  By  combining 
(17.59)  with  (17.60),  we  have 


min  F(x)  —  min  fix),  (17.61) 

*eR"  x  feasible 

which  is  simply  the  original  inequality-constrained  problem.  It  is  not  practical  to  minimize 
F  directly,  however,  since  this  function  is  not  smooth — it  jumps  from  a  finite  value  to  an 
infinite  value  as  x  crosses  the  boundary  of  the  feasible  set. 

We  can  make  this  approach  more  practical  by  replacing  F  by  a  smooth  approximation 
Fix;  Xk ,  j_ik)  which  depends  on  the  penalty  parameter  p k  and  Lagrange  multiplier  estimate 
Xk.  This  approximation  is  defined  as  follows: 


Fix;  A  ,  Uk)  —  max 
z>o 


I  /w  -  X!  xiciW  -  X!  (*•<■  _  xi)2 1  • 
[  iel  ie  l  J 


(17.62) 


The  final  term  in  this  expression  applies  a  penalty  for  any  move  of  A  away  from  the  previous 
estimate  Xk;  it  encourages  the  new  maximizer  A  to  stay  proximal  to  the  previous  estimate 
Xk.  Since  (17.62)  represents  a  bound-constrained  quadratic  problem  in  A,  separable  in  the 
individual  components  A, ,  we  can  perform  the  maximization  explicitly,  to  obtain 

[°  +  (i7  63) 
[  Xk  —  jikCi(x)  otherwise. 

By  substituting  these  values  in  (17.62),  we  find  that 

F(x\  Xk,  fik)  -  f{x)  +  ^2  fifiiix),  Af;  Uk),  (17.64) 

iel 

where  the  function  f  of  three  scalar  arguments  is  defined  as  follows: 

—at  +  —t2  iit  —  cr/w  <  0, 

l  /  (17-65) 

- a2  otherwise, 

2/u 


fit,  cr;  (i)  = 


Hence,  we  can  obtain  the  new  iterate  xk  by  minimizing  Fix',  Xk ,  fik)  with  respect  to  x, 
and  use  the  formula  (17.63)  to  obtain  the  updated  Lagrange  multiplier  estimates  A*+1.  By 
comparing  with  Framework  17.3,  we  see  that  F  plays  the  role  of  La  and  that  the  scheme 
just  described  extends  the  augmented  Lagrangian  methods  for  equality  constraints  neatly 
to  the  inequality-constrained  case.  Unlike  the  bound-constrained  and  linearly  constrained 
formulations,  however,  this  unconstrained  formulation  is  not  the  basis  of  any  widely  used 
software  packages,  so  its  practical  properties  have  not  been  tested. 
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1 7.5  PERSPECTIVES  AND  SOFTWARE 


The  quadratic  penalty  approach  is  often  used  by  practitioners  when  the  number  of  con¬ 
straints  is  small.  In  fact,  minimization  of  Q{x\  /x)  is  sometimes  performed  for  just  one  large 
value  of  jx.  Unless  [X  is  chosen  wisely  (with  the  benefit  of  experience  with  the  underlying 
application),  the  resulting  solution  may  not  be  very  accurate.  Since  the  main  software  pack¬ 
ages  for  constrained  optimization  do  not  implement  a  quadratic  penalty  approach,  little 
attention  has  been  paid  to  techniques  for  updating  the  penalty  parameter,  adjusting  the 
tolerances  r*,  and  choosing  the  starting  points  x\  for  each  iteration.  (See  Gould  [141]  for  a 
discussion  of  these  issues.) 

Despite  the  intuitive  appeal  and  simplicity  of  the  quadratic  penalty  method  of  Frame¬ 
work  17.1,  the  augmented  Lagrangian  method  of  Sections  17.3  and  17.4  is  generally 
preferred.  The  subproblems  are  in  general  no  more  difficult  to  solve,  and  the  introduc¬ 
tion  of  multiplier  estimates  reduces  the  likelihood  that  large  values  of  [x  will  be  needed  to 
obtain  good  feasibility  and  accuracy,  thereby  avoiding  ill  conditioning  of  the  subproblem. 
The  quadratic  penalty  approach  remains,  however,  an  important  mechanism  for  regularizing 
other  algorithms  such  as  sequential  quadratic  programming  (SQP)  methods,  as  we  mention 
at  the  end  of  Section  17.1. 

A  general-purpose  ix  penalty  method  was  developed  by  Fletcher  in  the  1980’s.  It 
is  known  as  the  SfiQP  method  because  it  has  features  in  common  with  SQP  methods. 
More  recently,  an  t\  penalty  method  that  uses  linear  programming  subproblems  has  been 
implemented  as  part  of  the  knitro  [46]  software  package.  These  two  methods  are  discussed 
in  Section  18.5. 

The  i\  penalty  function  has  received  significant  attention  in  recent  years.  It  has 
been  successfully  used  to  treat  difficult  problems,  such  as  mathematical  programs  with 
complementarity  constraints  (MPCCs),  in  which  the  constraints  do  not  satisfy  standard 
constraint  qualifications  [274].  By  including  these  problematic  constraints  as  a  penalty 
term,  rather  than  linearizing  them  exactly,  and  treating  the  remaining  constraints  using  other 
techniques  such  as  SQP  or  interior-point,  it  is  possible  to  extend  the  range  of  applicability 
of  these  other  approaches.  See  [8]  for  an  active-set  method  and  [16,  191]  for  interior-point 
methods  for  MPCCs.  The  snopt  software  package  uses  an  i\  penalty  approach  within  an 
SQP  method  as  a  safeguard  strategy  in  case  the  quadratic  model  appears  to  be  infeasible  or 
unbounded  or  to  have  unbounded  multipliers. 

Augmented  Lagrangian  methods  have  been  popular  for  many  years  because,  in  part, 
of  their  simplicity.  The  MINOS  and  LANCELOT  packages  rank  among  the  best  implemen¬ 
tations  of  augmented  Lagrangian  methods.  Both  are  suitable  for  large-scale  nonlinear 
programming  problems.  At  a  general  level,  the  linearly  constrained  Lagrangian  (LCL) 
of  minos  and  the  bound-constrained  Lagrangian  (BCL)  method  of  Lancelot  have  im¬ 
portant  features  in  common.  They  differ  significantly,  however,  in  the  formulation  of  the 
step-computation  subproblems  and  in  the  techniques  used  to  solve  these  subproblems. 
minos  follows  a  reduced-space  approach  to  handle  linearized  constraints  and  employs  a 
(dense)  quasi-Newton  approximation  to  the  Hessian  of  the  Lagrangian.  As  a  result,  minos 
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is  most  successful  for  problems  with  relatively  few  degrees  of  freedom.  Lancelot,  on  the 
other  hand,  is  more  effective  when  there  are  relatively  few  constraints.  As  indicated  in  Sec¬ 
tion  17.4,  Lancelot  does  not  require  a  factorization  of  the  constraint  Jacobian  matrix  A, 
again  enhancing  its  suitability  for  very  large  problems,  and  provides  a  variety  of  Hessian  ap¬ 
proximation  options  and  preconditioners.  The  pennon  software  package  [184]  is  based  on  an 
augmented  Lagrangian  approach  and  has  the  advantage  of  permitting  semi-definite  matrix 
constraints. 

A  weakness  of  both  the  bound-constrained  and  unconstrained  Lagrangian  methods 
is  that  they  complicate  constraints  by  squaring  them  in  (17.49);  progress  in  feasibility  is 
only  achieved  through  the  minimization  of  the  augmented  Lagrangian.  In  contrast,  the  LCL 
formulation  (17.55)  promotes  steady  progress  toward  feasibility  by  performing  a  Newton¬ 
like  step  on  the  constraints.  Not  surprisingly,  numerical  experience  has  shown  an  advantage 
of  minos  over  Lancelot  for  problems  with  linear  constraints. 

Smooth  exact  penalty  functions  have  been  constructed  from  the  augmented  La¬ 
grangian  functions  of  Section  17.3,  but  these  are  considerably  more  complicated.  As  an 
example,  we  mention  the  function  of  Fletcher  for  equality-constrained  problems,  defined  as 
follows: 


n)  —  fix)  -  X(x)Tc(x )  +  ^-'Y^ciixj1.  (17.66) 

i  e£ 

The  Lagrange  multiplier  estimates  L(.r)  are  defined  explicitly  in  terms  of  x  via  the  least- 
squares  estimate,  defined  as 

X(x)  =  [A(x)A{x)TrlA(x)Vf(x).  (17.67) 

The  function  (pv  is  differentiable  and  exact,  though  the  threshold  value  /x*  defining  the 
exactness  property  is  not  as  easy  to  specify  as  for  the  nonsmooth  l\  penalty  function. 
Drawbacks  of  the  penalty  function  <pf  include  the  cost  of  evaluating  X(x)  via  (17.67),  the  fact 
that  X(x)  is  not  uniquely  defined  when  A(x)  does  not  have  full  rank,  and  the  observation 
that  estimates  of  X  may  be  poor  when  A(x)  is  nearly  singular. 

NOTES  AND  REFERENCES 

The  quadratic  penalty  function  was  first  proposed  by  Courant  [81].  Gould  [140] 
addresses  the  issue  of  stable  determination  of  the  Newton  step  for  Q(x;  /x*)-  His  formula 
(2.2)  differs  from  our  formula  (17.20)  in  the  right-hand-side,  but  both  systems  give  rise  to 
the  same  p  component. 

The  augmented  Lagrangian  method  was  proposed  by  Hestenes  [167]  andPowell  [240]. 
In  the  early  days  it  was  known  as  the  “method  of  multipliers.”  A  key  reference  in  this 
area  is  Bertsekas  [18].  Chapters  1-3  of  that  book  contain  a  thorough  motivation  of  the 
method  that  outlines  its  connections  to  other  approaches.  Other  introductory  discussions 
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are  given  by  Fletcher  [101,  Section  12.2],  and  Polak  [236,  Section  2.8].  The  extension  to 
inequality  constraints  in  the  unconstrained  formulation  was  described  by  Rockafellar  [269] 
and  Powell  [243]. 

Linearly  constrained  Lagrangian  methods  were  proposed  by  Robinson  [266]  and 
Rosen  and  Kreuser  [271].  The  minos  implementation  is  due  to  Murtagh  and  Saunders  [218] 
and  the  Lancelot  implementation  due  to  Conn,  Gould  and  Toint  [72].  We  have  followed 
Friedlander  and  Saunders  [114]  in  our  use  of  the  terms  “linearly  constrained  Lagrangian” 
and  “bound-constrained  Lagrangian.” 


i#7  Exercises 

i#7  17.1 

(a)  Write  an  equality-constrained  problem  which  has  a  local  solution  and  for  which  the 
quadratic  penalty  function  Q  is  unbounded  for  any  value  of  the  penalty  parameter. 

(b)  Write  a  problem  with  a  single  inequality  constraint  that  has  the  same  unboundedness 
property. 

i#7  17.2  Draw  the  contour  lines  of  the  quadratic  penalty  function  Q  for  problem  (17.5) 

corresponding  to  [x  —  1.  Find  the  stationary  points  of  Q. 

i#7  17.3  Minimize  the  quadratic  penalty  function  for  problem  (17.3)  for  /x^,  = 

1,  10,  100,  1000  using  an  unconstrained  minimization  algorithm.  Set  =  1/au-  in  Frame¬ 
work  17.1,  and  choose  the  starting  point  Xk+1  for  each  minimization  to  be  the  solution 
for  the  previous  value  of  the  penalty  parameter.  Report  the  approximate  solution  of  each 
penalty  function. 

i#7  17.4  For  zeR,  show  that  the  function  min(0,  z)2  has  a  discontinuous  second  deriva¬ 

tive  at  z  =  0.  (It  follows  that  quadratic  penalty  function  (17.7)  may  not  have  continuous 
second  derivatives  even  when  f  and  c,-,  i  e  £  U  X,  in  (17.6)  are  all  twice  continuously 
differentiable.) 

i#7  17.5  Write  a  quadratic  program  similar  to  (17.31)  for  the  case  when  the  norm  in 

(17.32)  is  the  infinity  norm. 

i#7  17.6  Suppose  that  a  nonlinear  program  has  a  minimizer  x*  with  Lagrange  multiplier 

vector  A*.  One  can  show  (  Fletcher  [101,  Theorem  14.3.2])  that  the  function  <pi(x;  /x)  does 
not  have  a  local  minimizer  at  x*  unless  [x  >  ||A.*||oo.  Verify  that  this  observation  holds  for 
Example  17.1. 

i#7  17.7  Verify  (17.28). 

i#7  17.8  Prove  the  second  part  of  Theorem  17.4.  That  is,  if  .f  is  a  stationary  point  of 

<f>i  (x;  fx)  for  all  /x  sufficiently  large,  but  x  is  infeasible  for  problem  (17.6),  then  x  is 
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an  infeasible  stationary  point.  (Hint:  Use  the  fact  that  D(<j> i(x;  /x);  p)  =  V  f{x)T p  + 
pD(h(x)\  p),  where  h  is  defined  in  (17.27).) 

&  17.9  Verify  that  the  KKT  conditions  for  the  bound-constrained  problem 

min  </>(x)  subject  to  /  <  x  <  u 

xeRn 

are  equivalent  to  the  compactly  stated  condition 


x  —  P(x  —  V0(x),  l,  u)  —  0, 


where  the  projection  operator  P  onto  the  rectangular  box  [/,  u\  is  defined  in  (17.52). 

&  17.10  Calculate  the  gradient  and  Hessian  of  the  LCL  obj  ective  functions  Fk(x)  defined 

by  (17.56)  and  (17.58).  Evaluate  these  quantities  at  x  =  je*. 

&  17.11  Show  that  the  function  i jr{t,  a\  /x)  defined  in  (17.65)  has  a  discontinuity  in 

its  second  derivative  with  respect  to  t  when  t  —  a / p.  Assuming  that  c,-  :  R"  ->  R 
is  twice  continuously  differentiable,  write  down  the  second  partial  derivative  matrix  of 
i js(ci(x),  A n)  with  respect  to  x  for  the  two  cases  c,  (x)  <  A,- / p  and  c,  (x)  >  aXj/jx. 

&  17.12  Verify  that  the  multipliers  A,-,  i  e  X  defined  in  (17.63)  are  indeed  those  that 

attain  the  maximum  in  (17.62),  and  that  the  equality  (17.64)  holds.  Hint:  Use  the  fact  that 
KKT  conditions  for  the  problem 

max  (p{x)  subject  to  x  >  0 

indicate  that  at  a  stationary  point,  we  either  have  x,  =  0  and  [V0(x)]/  <  0,  or  x,-  >  0  and 
[V0(x)j,  =  0. 
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One  of  the  most  effective  methods  for  nonlinearly  constrained  optimization  generates  steps 
by  solving  quadratic  subproblems.  This  sequential  quadratic  programming  (SQP)  approach 
can  be  used  both  in  line  search  and  trust-region  frameworks,  and  is  appropriate  for  small 
or  large  problems.  Unlike  linearly  constrained  Lagrangian  methods  (Chapter  17),  which  are 
effective  when  most  of  the  constraints  are  linear,  SQP  methods  show  their  strength  when 
solving  problems  with  significant  nonlinearities  in  the  constraints. 

All  the  methods  considered  in  this  chapter  are  active-set  methods;  a  more  descriptive 
title  for  this  chapter  would  perhaps  be  “Active-Set  Methods  for  Nonlinear  Programming.” 
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In  Chapter  14  we  study  interior-point  methods  for  nonlinear  programming,  a  competing 
approach  for  handling  inequality-constrained  problems. 

There  are  two  types  of  active-set  SQP  methods.  In  the  IQP  approach,  a  general 
inequality- constrained  quadratic  program  is  solved  at  each  iteration,  with  the  twin  goals 
of  computing  a  step  and  generating  an  estimate  of  the  optimal  active  set.  EQP  methods 
decouple  these  computations.  They  first  compute  an  estimate  of  the  optimal  active  set,  then 
solve  an  equality-constrained  quadratic  program  to  find  the  step.  In  this  chapter  we  study 
both  IQP  and  EQP  methods. 

Our  development  of  SQP  methods  proceeds  in  two  stages.  First,  we  consider  local 
methods  that  motivate  the  SQP  approach  and  allow  us  to  introduce  the  step  computation 
techniques  in  a  simple  setting.  Second,  we  consider  practical  line  search  and  trust-region 
methods  that  achieve  convergence  from  remote  starting  points.  Throughout  the  chapter  we 
give  consideration  to  the  algorithmic  demands  of  solving  large  problems. 


1 8.1  LOCAL  SQP  METHOD 


We  begin  by  considering  the  equality-constrained  problem 

min/(x)  (18.1a) 

subject  to  c(x)  =  0,  (18.1b) 

where  /  :  R"  -a  R  and  c  :  R"  ->  R'"  are  smooth  functions.  The  idea  behind  the 
SQP  approach  is  to  model  (18.1)  at  the  current  iterate  Xk  by  a  quadratic  programming 
subproblem,  then  use  the  minimizer  of  this  subproblem  to  define  a  new  iterate  Xk+\.  The 
challenge  is  to  design  the  quadratic  subproblem  so  that  it  yields  a  good  step  for  the  nonlinear 
optimization  problem.  Perhaps  the  simplest  derivation  of  SQP  methods,  which  we  present 
now,  views  them  as  an  application  of  Newton’s  method  to  the  KKT  optimality  conditions 
for  (18.1). 

From  (12.33),  we  know  that  the  Lagrangian  function  for  this  problem  is  C(x,  X)  — 
f(x)  —  XTc{x).  We  use  A{x)  to  denote  the  Jacobian  matrix  of  the  constraints,  that  is, 

A(x)t  —  [Vci(jc),  Vc2(x),  . . . ,  Vcm(;t)],  (18.2) 


where  c,(x)  is  the  / th  component  of  the  vector  c(x).  The  first-order  (KKT)  conditions 
(12.34)  of  the  equality-constrained  problem  (18.1)  can  be  written  as  a  system  of  n  +  m 
equations  in  the  n  +  m  unknowns  x  and  X: 


F(x,  X)  = 


V f (x )  -  A(x)TX 
c(x) 


=  0. 


(18.3) 


Any  solution  (x*,  X *)  of  the  equality-constrained  problem  (18.1)  for  which  A(x*)  has  full 
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rank  satisfies  (18.3).  One  approach  that  suggests  itself  is  to  solve  the  nonlinear  equations 
(18.3)  by  using  Newton’s  method,  as  described  in  Chapter  11. 

The  Jacobian  of  (18.3)  with  respect  to  x  and  A  is  given  by 


F'(x,  X)  = 


V2xxC{x,X)  —A(x)t 
A(x)  0 


(18.4) 


The  Newton  step  from  the  iterate  (ay,  A*)  is  thus  given  by 


%k+ 1 

_ 

Xk 

+ 

Pk 

Afc+i 

Px 

(18.5) 


where  pk  and  pk  solve  the  Newton-KKT  system 


'  v,2xA- 

-AT  1 
Ak 

Pk 

'  -V/*  +  A[ A,  " 

M : 

0 

Px 

~Ck 

(18.6) 


This  Newton  iteration  is  well  defined  when  the  KKT  matrix  in  (18.6)  is  nonsingular.  We 
saw  in  Chapter  16  that  this  matrix  is  nonsingular  if  the  following  assumption  holds  at 

(x,  X)  —  (xk,  Xk). 

Assumptions  18.1. 

(a)  The  constraint  Jacobian  A(x)  has  full  row  rank; 

(b)  The  matrix  VxxC(x,  X)  is  positive  definite  on  the  tangent  space  of  the  constraints,  that  is, 
dT\'xxC(x,  X)d  >  0  for  all d  /  0  such  that  A{x)d  —  0. 

The  first  assumption  is  the  linear  independence  constraint  qualification  discussed  in 
Chapter  12  (see  Definition  12.4),  which  we  assume  throughout  this  chapter.  The  second 
condition  holds  whenever  (x,  X)  is  close  to  the  optimum  (x*,  A*)  and  the  second-order  suf¬ 
ficient  condition  is  satisfied  at  the  solution  (see  Theorem  12.6).  The  Newton  iteration  (18.5), 

(18.6)  can  be  shown  to  be  quadratically  convergent  under  these  assumptions  (see  Theo¬ 
rem  18.4)  and  constitutes  an  excellent  algorithm  for  solving  equality-constrained  problems, 
provided  that  the  starting  point  is  close  enough  to  x*. 

SQP  FRAMEWORK 

There  is  an  alternative  way  to  view  the  iteration  (18.5),  (18.6).  Suppose  that  at  the 
iterate  (xk,  Xk)  we  model  problem  ( 18.1)  using  the  quadratic  program 

min  fk  +  V  fl  p  +  \pTV2xxCkp  ( 18.7a) 

P 

subject  to  Akp  +  Ck  =  0.  (18.7b) 


532  Chapter  18.  Sequential  Quadratic  Programming 


If  Assumptions  18.1  hold,  this  problem  has  a  unique  solution  (pk,  4)  that  satisfies 


VxxCkpk  +  V/*  -  A\lk  —  0,  (18.8a) 

Akpk  +  ck  —  0.  (18.8b) 

The  vectors  pk  and  lk  can  be  identified  with  the  solution  of  the  Newton  equations  (18.6).  If 
we  subtract  ATkXk  from  both  sides  of  the  first  equation  in  (18.6),  we  obtain 


r  v*2vA 

i — 

1 

Pk 

"  -v/t  ' 

_ i 

0  J 

■4t+i 

~Ck 

Hence,  by  nonsingularity  of  the  coefficient  matrix,  we  have  that  Xk+i  =  lk  and  that  pk  solves 
(18.7)  and  (18.6). 

The  new  iterate  (xk+i,  T-a+i)  can  therefore  be  defined  either  as  the  solution  of  the 
quadratic  program  (18.7)  or  as  the  iterate  generated  by  Newton’s  method  (18.5),  (18.6) 
applied  to  the  optimality  conditions  of  the  problem.  Both  viewpoints  are  useful.  The  Newton 
point  of  view  facilitates  the  analysis,  whereas  the  SQP  framework  enables  us  to  derive 
practical  algorithms  and  to  extend  the  technique  to  the  inequality-constrained  case. 

We  now  state  the  SQP  method  in  its  simplest  form. 

Algorithm  18.1  (Local  SQP  Alsorithm  for  solvins  (18.1)). 

Choose  an  initial  pair  (xq,  Xq);  set  k  4-  0; 
repeat  until  a  convergence  test  is  satisfied 
Evaluate  fk,  V  fk,  VxxCk,  ck,  and  Ak; 

Solve  (18.7)  to  obtain  pk  and  4; 

Set  xk+i  k  xk  +  pk  and  Xk+  i  ^  4> 
end  (repeat) 

We  note  in  passing  that,  in  the  objective  (18.7a)  of  the  quadratic  program,  we  could 
replace  the  linear  term  V  fk  p  by  V, :C(xk,  Xk)T  p,  since  the  constraint  (18.7b)  makes  the 
two  choices  equivalent.  In  this  case,  (18.7a)  is  a  quadratic  approximation  of  the  Lagrangian 
function.  This  fact  provides  a  motivation  for  our  choice  of  the  quadratic  model  (18.7):  We 
first  replace  the  nonlinear  program  (18.1)  by  the  problem  of  minimizing  the  Lagrangian 
subject  to  the  equality  constraints  (18.1b),  then  make  a  quadratic  approximation  to  the 
Lagrangian  and  a  linear  approximation  to  the  constraints  to  obtain  (18.7). 

INEQUALITY  CONSTRAINTS 

The  SQP  framework  can  be  extended  easily  to  the  general  nonlinear  programming 
problem 


min  f(x) 


(18.10a) 
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subject  to  Cj{x)  =  0,  ie£,  (18.10b) 

Ci(x)>  0,  ieZ.  (18.10c) 

To  model  this  problem  we  now  linearize  both  the  inequality  and  equality  constraints  to 
obtain 


nun  fk  +  V  fl  p  +  \pTV2xxCkp  (18.1  la) 

subjectto  V  a{xk)T  p  +  Ci{xk)  —  0,  ie£,  (18.11b) 

Vci(xk)T p  +  ci{xk)  >  0,  i  el  (18.11c) 

We  can  use  one  of  the  algorithms  for  quadratic  programming  described  in  Chapter  16  to 
solve  this  problem.  The  new  iterate  is  given  by  (xk  +  pk,  lc+i)  where  pk  and  Xk+ 1  are  the 
solution  and  the  corresponding  Lagrange  multiplier  of  (18.11).  A  local  SQP  method  for 
(18.10)  is  thus  given  by  Algorithm  18.1  with  the  modification  that  the  step  is  computed 
from  (18.11). 

In  this  IQP  approach  the  set  of  active  constraints  Ak  at  the  solution  of  (18.11) 
constitutes  our  guess  of  the  active  set  at  the  solution  of  the  nonlinear  program.  If  the 
SQP  method  is  able  to  correctly  identify  this  optimal  active  set  (and  not  change  its  guess 
at  a  subsequent  iteration)  then  it  will  act  like  a  Newton  method  for  equality-constrained 
optimization  and  will  converge  rapidly.  The  following  result  gives  conditions  under  which 
this  desirable  behavior  takes  place.  Recall  that  strict  complementarity  is  said  to  hold  at  a 
solution  pair  (x* ,  A*)  if  there  is  no  index  i  e  X  such  that  A*  =  c,-  (x*)  =  0. 

Theorem  18.1  (Robinson  [267]). 

Suppose  that  x*  is  a  local  solution  of  (18.10)  at  which  the  KKT  conditions  are  satis¬ 
fied  for  some  A*.  Suppose,  too,  that  the  linear  independence  constraint  qualification  (LICQ) 
(Definition  12.4),  the  strict  complementarity  condition  (Definition  12.5),  and  the  second- order 
sufficient  conditions  (Theorem  12.6)  hold  at  (x*,  A*).  Then  if  (xk,  A *)  is  sufficiently  close  to 
(x* ,  A*),  there  is  a  local  solution  of  the  subproblem  (18.11)  whose  active  set  Ak  is  the  same  as 
the  active  set  A(x*)  of  the  nonlinear  program  (18.10)  atx*. 

It  is  also  remarkable  that,  far  from  the  solution,  the  SQP  approach  is  usually  able  to  improve 
the  estimate  of  the  active  set  and  guide  the  iterates  toward  a  solution;  see  Section  18.7. 


1 8-2  PREVIEW  OF  PRACTICAL  SQP  METHODS 

IQP  AND  EQP 


There  are  two  ways  of  designing  SQP  methods  for  solving  the  general  nonlinear 
programming  problem  (18.10).  The  first  is  the  approach  just  described,  which  solves  at 
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every  iteration  the  quadratic  subprogram  (18.11),  taking  the  active  set  at  the  solution  of 
this  subproblem  as  a  guess  of  the  optimal  active  set.  This  approach  is  referred  to  as  the 
IQP  (inequality-constrained  QP)  approach;  it  has  proved  to  be  quite  successful  in  practice. 
Its  main  drawback  is  the  expense  of  solving  the  general  quadratic  program  (18.11),  which 
can  be  high  when  the  problem  is  large.  As  the  iterates  of  the  SQP  method  converge  to 
the  solution,  however,  solving  the  quadratic  subproblem  becomes  economical  if  we  use 
information  from  the  previous  iteration  to  make  a  good  guess  of  the  optimal  solution  of  the 
current  subproblem.  This  warm-start  strategy  is  described  below. 

The  second  approach  selects  a  subset  of  constraints  at  each  iteration  to  be  the  so-called 
working  set,  and  solves  only  equality-constrained  subproblems  of  the  form  (18.7),  where 
the  constraints  in  the  working  sets  are  imposed  as  equalities  and  all  other  constraints  are 
ignored.  The  working  set  is  updated  at  every  iteration  by  rules  based  on  Lagrange  multiplier 
estimates,  or  by  solving  an  auxiliary  subproblem.  This  EQP  (equality-constrained  QP) 
approach  has  the  advantage  that  the  equality-constrained  quadratic  subproblems  are  less 
expensive  to  solve  than  (18.11)  in  the  large-scale  case. 

An  example  of  an  EQP  method  is  the  sequential  linear-quadratic  programming 
(SLQP)  method  discussed  in  Section  18.5.  This  approach  constructs  a  linear  program  by 
omitting  the  quadratic  term  pTVjxCkp  from  (18.11a)  and  adding  a  trust-region  constraint 
llplloo  <  A k  to  the  subproblem.  The  active  set  of  the  resulting  linear  programming  sub¬ 
problem  is  taken  to  be  the  working  set  for  the  current  iteration.  The  method  then  fixes  the 
constraints  in  the  working  set  and  solves  an  equality- constrained  quadratic  program  (with 
the  term  pTVxxCkp  reinserted)  to  obtain  the  SQP  step.  Another  successful  EQP  method 
is  the  gradient  projection  method  described  in  Section  16.7  in  the  context  of  bound  con¬ 
strained  quadratic  programs.  In  this  method,  the  working  set  is  determined  by  minimizing 
a  quadratic  model  along  the  path  obtained  by  projecting  the  steepest  descent  direction  onto 
the  feasible  region. 

ENFORCING  CONVERGENCE 

To  be  practical,  an  SQP  method  must  be  able  to  converge  from  remote  starting  points 
and  on  nonconvex  problems.  We  now  outline  how  the  local  SQP  strategy  can  be  adapted  to 
meet  these  goals. 

We  begin  by  drawing  an  analogy  with  unconstrained  optimization.  In  its  simplest 
form,  the  Newton  iteration  for  minimizing  a  function  f  takes  a  step  to  the  minimizer  of  the 
quadratic  model 


mk{p)  =  fk  +  V//  p  +  \pTV2fkp. 

This  framework  is  useful  near  the  solution,  where  the  Hessian  V2/(x^)  is  normally  positive 
definite  and  the  quadratic  model  has  a  well  defined  minimizer.  When  xk  is  not  close  to  the 
solution,  however,  the  model  function  mk  may  not  be  convex.  Trust-region  methods  ensure 
that  the  new  iterate  is  always  well  defined  and  useful  by  restricting  the  candidate  step  pk 
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to  some  neighborhood  of  the  origin.  Line  search  methods  modify  the  Hessian  in  nikip )  to 
make  it  positive  definite  (possibly  replacing  it  by  a  quasi-Newton  approximation  Bk),  to 
ensure  that  pk  is  a  descent  direction  for  the  objective  function  /. 

Similar  strategies  are  used  to  globalize  SQP  methods.  IfV2x£k  is  positive  definite  on 
the  tangent  space  of  the  active  constraints,  the  quadratic  subproblem  (18.7)  has  a  unique 
solution.  When  Vxx£k  does  not  have  this  property,  line  search  methods  either  replace  it 
by  a  positive  definite  approximation  Bk  or  modify  Vxx£k  directly  during  the  process  of 
matrix  factorization.  In  all  these  cases,  the  subproblem  (18.7)  becomes  well  defined,  but  the 
modifications  may  introduce  unwanted  distortions  in  the  model. 

Trust-region  SQP  methods  add  a  constraint  to  the  subproblem,  limiting  the  step  to 
a  region  within  which  the  model  (18.7)  is  considered  reliable.  These  methods  are  able  to 
handle  indefinite  Hessians  Vxx£k.  The  inclusion  of  the  trust  region  may,  however,  cause  the 
subproblem  to  become  infeasible,  and  the  procedures  for  handling  this  situation  complicate 
the  algorithms  and  increase  their  computational  cost.  Due  to  these  tradeoffs,  neither  of  the 
two  SQP  approaches — line  search  or  trust-region — is  currently  regarded  as  clearly  superior 
to  the  other. 

The  technique  used  to  accept  or  reject  steps  also  impacts  the  efficiency  of  SQP  methods. 
In  unconstrained  optimization,  the  merit  function  is  simply  the  objective  /,  and  it  remains 
fixed  throughout  the  minimization  procedure.  For  constrained  problems,  we  use  devices 
such  as  a  merit  function  or  a  filter  (see  Section  15.4).  The  parameters  or  entries  used  in 
these  devices  must  be  updated  in  a  way  that  is  compatible  with  the  step  produced  by  the 
SQP  method. 


18.3  ALGORITHMIC  DEVELOPMENT 


In  this  section  we  expand  on  the  ideas  of  the  previous  section  and  describe  various  ingredients 
needed  to  produce  practical  SQP  algorithms.  We  focus  on  techniques  for  ensuring  that  the 
subproblems  are  always  feasible,  on  alternative  choices  for  the  Hessian  of  the  quadratic 
model,  and  on  step-acceptance  mechanisms. 


HANDLING  INCONSISTENT  LINEARIZATIONS 

Apossible  difficulty  with  SQP  methods  is  that  the  linearizations  (18.11b),  (18.11c)  of 
the  nonlinear  constraints  may  give  rise  to  an  infeasible  subproblem.  Consider,  for  example, 
the  case  in  which  n  —  1  and  the  constraints  are  x  <  1  and  x2  >  4.  When  we  linearize  these 
constraints  at  Xk  —  1,  we  obtain  the  inequalities 

—p  >  0  and  2p  —  3  >  0, 


which  are  inconsistent. 
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To  overcome  this  difficulty,  we  can  reformulate  the  nonlinear  program  (18.10)  as  the 
1 1  penalty  problem 


min 

fix)  +  /x  Y^iv‘ 

+  Wj)  +  [X  ^2  fi 

(18.12a) 

x,v,w, 

ieS 

iel 

subject  to 

Ci(x)  —  Vi  -  Wj. 

,  i  e  £, 

(18.12b) 

Ciix)  >  -ti,  l 

'  el, 

(18.12c) 

v,  w,  t  >  0, 

(18.12d) 

for  some  positive  choice  of  the  penalty  parameter  fx.  The  quadratic  subproblem  (18.11) 
associated  with  (18.12)  is  always  feasible.  As  discussed  in  Chapter  17,  if  the  nonlinear 
problem  (18.10)  has  a  solution  x*  that  satisfies  certain  regularity  assumptions,  and  if  the 
penalty  parameter  [x  is  sufficiently  large,  then  x *  (along  with  v*  —  w*  =  0,  i  e  £  and 
t*  —  0, i  e  1)  is  a  solution  of  the  penalty  problem  ( 18.12).  If,  on  the  other  hand,  there  is  no 
feasible  solution  to  the  nonlinear  problem  and  /x  is  large  enough,  then  the  penalty  problem 
( 18.12)  usually  determines  a  stationary  point  of  the  infeasibility  measure.  The  choice  of  fx 
has  been  discussed  in  Chapter  17  and  is  considered  again  in  Section  18.5.  The  snopt  software 
package  [127]  uses  the  formulation  (18.12),  which  is  sometimes  called  the  elastic  mode,  to 
deal  with  inconsistencies  of  the  linearized  constraints. 

Other  procedures  for  relaxing  the  constraints  are  presented  in  Section  18.5  in  the 
context  of  trust-region  methods. 

FULL  QUASI-NEWTON  APPROXIMATIONS 

The  Hessian  of  the  Lagrangian  VxxC(xk,  Xk)  is  made  up  of  second  derivatives  of 
the  objective  function  and  constraints.  In  some  applications,  this  information  is  not  easy  to 
compute,  so  it  is  useful  to  consider  replacing  the  Hessian  xC{xk ,  Xk )  in  ( 18. 1  la)  by  a  quasi- 
Newton  approximation.  Since  the  BFGS  and  SRI  formulae  have  proved  to  be  successful  in 
the  context  of  unconstrained  optimization,  we  can  employ  them  here  as  well. 

The  update  for  Bk  that  results  from  the  step  from  iterate  k  to  iterate  k  +  1  makes  use 
of  the  vectors  sk  and  yk  defined  as  follows: 


sk  —  xk+  i-xk,  yk  =  VxC{xk+  uXk+l) -S7xC(xk,Xk+i).  (18.13) 

We  compute  the  new  approximation  Bk+l  using  the  BFGS  or  SRI  formulae  given,  respec¬ 
tively,  by  (6.19)  and  (6.24).  We  can  view  this  process  as  the  application  of  quasi-Newton 
updating  to  the  case  in  which  the  objective  function  is  given  by  the  Lagrangian  C(x,  X)  (with 
X  fixed).  This  viewpoint  immediately  reveals  the  strengths  and  weaknesses  of  this  approach. 

If  Vjv£  is  positive  definite  in  the  region  where  the  minimization  takes  place,  then 
BFGS  quasi-Newton  approximations  Bk  will  reflect  some  of  the  curvature  information  of  the 
problem,  and  the  iteration  will  converge  robustly  and  rapidly,  just  as  in  the  unconstrained 
BFGS  method.  If,  however,  V^t£  contains  negative  eigenvalues,  then  the  BFGS  approach 
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of  approximating  it  with  a  positive  definite  matrix  may  be  problematic.  BFGS  updating 
requires  that  sk  and  yk  satisfy  the  curvature  condition  si  yk  >  0,  which  may  not  hold  when 
sk  and  yk  are  defined  by  (18.13),  even  when  the  iterates  are  close  to  the  solution. 

To  overcome  this  difficulty,  we  could  skip  the  BFGS  update  if  the  condition 

s[y'k  >  0slBksk  (18.14) 

is  not  satisfied,  where  9  is  a  positive  parameter  (10~2,  say).  This  strategy  may,  on  occasion, 
yield  poor  performance  or  even  failure,  so  it  cannot  be  regarded  as  adequate  for  general- 
purpose  algorithms. 

A  more  effective  modification  ensures  that  the  update  is  always  well  defined  by 
modifying  the  definition  of  yk. 

Procedure  18.2  (Damped  BFGS  Updatins). 

Given:  symmetric  and  positive  definite  matrix  Bk ; 

Define  sk  and  yk  as  in  (18.13)  and  set 


t'k  —  0ky’k  -fi  (1  0k)Bksk , 


where  the  scalar  9k  is  defined  as 


|  1  iisk  yk  >  0.2sk  Bksk, 

\  (0.8 sk  Bksk)/ {sTk  Bksk  -  si yk)  if  s'l yk  <  0.2 si Bksk ; 


(18.15) 


Update  Bk  as  follows: 


Bk+i  =  Bk 


Bksksl  Bk  rki'l 

si  Bksk  si  rk 


(18.16) 


The  formula  (18.16)  is  simply  the  standard  BFGS  update  formula,  with  yk  replaced 
by  rk.  It  guarantees  that  Bk+l  is  positive  definite,  since  it  is  easy  to  show  that  when  9k  ^  1 
we  have 


s'lrk  =  ().2s'lBksk  >  0.  (18.17) 

To  gain  more  insight  into  this  strategy,  note  that  the  choice  0k  —  0  gives  Bk+1  =  Bk,  while 
0k  —  1  gives  the  (possibly  indefinite)  matrix  produced  by  the  unmodified  BFGS  update.  A 
value  9k  e  (0,  1)  thus  produces  a  matrix  that  interpolates  the  current  approximation  Bk 
and  the  one  produced  by  the  unmodified  BFGS  formula.  The  choice  of  9k  ensures  that  the 
new  approximation  stays  close  enough  to  the  current  approximation  Bk  to  ensure  positive 
definiteness. 
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Damped  BFGS  updating  often  works  well  but  it,  too,  can  behave  poorly  on  difficult 
problems.  It  still  fails  to  address  the  underlying  problem  that  the  Lagrangian  Hessian  may 
not  be  positive  definite.  For  this  reason,  SRI  updating  may  be  more  appropriate,  and  is 
indeed  a  good  choice  for  trust-region  SQP  methods.  An  SRI  approximation  to  the  Hessian 
of  the  Lagrangian  is  obtained  by  applying  formula  (6.24)  with  sk  and  yk  defined  by  ( 18.13), 
using  the  safeguards  described  in  Chapter  6.  Line  search  methods  cannot,  however,  accept 
indefinite  Hessian  approximations  and  would  therefore  need  to  modify  the  SRI  formula, 
possibly  by  adding  a  sufficiently  large  multiple  of  the  identity  matrix;  see  the  discussion 
around  ( 19.25). 

All  quasi-Newton  approximations  Bk  discussed  above  are  dense  n  x  n  matrices  that 
can  be  expensive  to  store  and  manipulate  in  the  large-scale  case.  Limited-memory  updating 
is  useful  in  this  context  and  is  often  implemented  in  software  packages.  (See  (19.29)  for  an 
implementation  of  limited-memory  BFGS  in  a  constrained  optimization  algorithm.) 

REDUCED-HESSIAN  QUASI-NEWTON  APPROXIMATIONS 

When  we  examine  the  KKT  system  (18.9)  for  the  equality-constrained  problem  (18.1), 
we  see  that  the  part  of  the  step  pk  in  the  range  space  of  Aj  is  completely  determined  by 
the  second  block  row  Akpk  —  —  ck.  The  Lagrangian  Hessian  V2x£k  affects  only  the  part 
of  pi  in  the  orthogonal  subspace,  namely,  the  null  space  of  A*.  It  is  reasonable,  therefore, 
to  consider  quasi-Newton  methods  that  find  approximations  to  only  that  part  of  V\xLk 
that  affects  the  component  of  pi  in  the  null  space  of  A*.  In  this  section,  we  consider 
quasi-Newton  methods  based  on  these  reduced-Hessian  approximations.  Our  focus  is  on 
equality-constrained  problems  in  this  section,  as  existing  SQP  methods  for  the  full  problem 
( 18.10)  use  reduced-Hessian  approaches  only  after  an  equality- constrained  subproblem  has 
been  generated. 

To  derive  reduced-Hessian  methods,  we  consider  solution  of  the  step  equations  (18.9) 
by  means  of  the  null  space  approach  of  Section  16.2.  In  that  section,  we  defined  matrices  Yi 
and  Zi  whose  columns  span  the  range  space  of  Aj  and  the  null  space  of  A*,  respectively.  By 
writing 


Pk  =  Ykpy  +  ZkPz,  (18.18) 

and  substituting  into  ( 18.9),  we  obtain  the  following  system  to  be  solved  for  pY  and  pz: 

(AkYk)pY  —  —ck,  (18.19a) 

(Z[ V2xxCkZk)  pz  =  -Z'k  V2xxCkYkpY  -  Z'k  V/t.  (18.19b) 

From  the  first  block  of  equations  in  (18.9)  we  see  that  the  Lagrange  multipliers  kk+i,  which 
are  sometimes  called  QP  multipliers,  can  be  obtained  by  solving 


(AkYk)TXk+l  =  Yk  (V/t  +  V2xxCkPk). 


(18.20) 
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We  can  avoid  computation  of  the  Hessian  V2x£k  by  introducing  several  approx¬ 
imations  in  the  null-space  approach.  First,  we  delete  the  term  involving  pk  from  the 
right-hand-side  of  ( 18.20),  thereby  decoupling  the  computations  of  pk  and  \.k+\  and  elimi¬ 
nating  the  need  for  Vxx£k  in  this  term.  This  simplification  can  be  justified  by  observing  that 
Pk  converges  to  zero  as  we  approach  the  solution,  whereas  V  fk  normally  does  not.  There¬ 
fore,  the  multipliers  computed  in  this  manner  will  be  good  estimates  of  the  QP  multipliers 
near  the  solution.  More  specifically,  if  we  choose  Yk  —  ATk  (which  is  a  valid  choice  for  Yk 
when  Ak  has  full  row  rank;  see  (15.16)),  we  obtain 


h+i  =  (AtAlr'AkVfk.  (18.21) 

These  are  called  the  least-squares  multipliers  because  they  can  also  be  derived  by  solving  the 
problem 


min  \\VxC{Xk,ml  =  |V/t-  AIX\\22.  (18.22) 

This  observation  shows  that  the  least-squares  multipliers  are  useful  even  when  the  current 
iterate  is  far  from  the  solution,  because  they  seek  to  satisfy  the  first-order  optimality  condition 
in  (18.3)  as  closely  as  possible.  Conceptually,  the  use  of  least-squares  multipliers  transforms 
the  SQP  method  from  a  primal-dual  iteration  in  x  and  A  to  a  purely  primal  iteration  in  the 
x  variable  alone. 

Our  second  simplification  of  the  null-space  approach  is  to  remove  the  cross  term 
Zj VxxCkYk pY  in  (18.19b),  thereby  yielding  the  simpler  system 


(Zj V2xxCkZk)pz  =  -ZTk  V/t.  (18.23) 

This  approach  has  the  advantage  that  it  needs  to  approximate  only  the  matrix  Zj  V 2x£kZk , 
not  the  (n  —  m )  x  m  cross-term  matrix  ZlV2x£kYk ,  which  is  a  relatively  large  matrix 
when  m  n  —  m.  Dropping  the  cross  term  is  justified  when  Z[ Vxx£kZk  is  replaced  by  a 
quasi-Newton  approximation  because  the  normal  component  pY  usually  converges  to  zero 
faster  than  the  tangential  component  pz,  thereby  making  ( 18.23)  a  good  approximation  of 
(18.19b). 

Having  dispensed  with  the  partial  Hessian  ZjVxx£k  Yk ,  we  discuss  how  to  approximate 
the  remaining  part  ZAr  V2x£kZk .  Suppose  we  have  just  taken  a  step  oikPk  =  Xk+i  —  Xk  — 
akZkpz  +  ak  Ykpz.  By  Taylor’s  theorem,  writing  V2xx£k+i  =  V2xxC{xk+ 1,  A*+1),  we  have 

tXkPk  ^  VA.  £, (Xk  “}-  OtkPki  t'-k+ 1  )  VA  £  (xk .  A/._|_i). 

By  premultiplying  by  Zk,  we  have 


Zk  VA.v  £k+ 1  Zk ak pz  (18.24) 

^  ~zk  Vxx£k+iYko/kPY  +  Z\  [Vx£{xk  +  akPk,  h+i)  ~  Vx£(xk,  Aa+1)]  . 
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If  we  drop  the  cross  term  ZTk  V^/^+i  YkakpY  (using  the  rationale  discussed  earlier),  we  see 
that  the  secant  equation  for  Mk  can  be  defined  by 


Mk+isk  —  yk. 


(18.25) 


where  sk  and  yk  are  given  by 

Si t  =  UkPz,  yk  —  Zj  [V xC(xk  +  akpk,  Xk+l)  -  VxC(xk,  h+i)]  ■  (18.26) 

We  then  apply  the  BFGS  or  SRI  formulae,  using  these  definitions  for  the  correction  vectors 
Si t  and  yk,  to  define  the  new  approximation  Mk+l.  An  advantage  of  this  reduced-Hessian 
approach,  compared  to  full-Hessian  quasi-Newton  approximations,  is  that  the  reduced 
Hessian  is  much  more  likely  to  be  positive  definite,  even  when  the  current  iterate  is  some 
distance  from  the  solution.  When  using  the  BFGS  formula,  the  safeguarding  mechanism 
discussed  above  will  be  required  less  often  in  line  search  implementations. 

MERIT  FUNCTIONS 

SQP  methods  often  use  a  merit  function  to  decide  whether  a  trial  step  should  be 
accepted.  In  line  search  methods,  the  merit  function  controls  the  size  of  the  step;  in  trust- 
region  methods  it  determines  whether  the  step  is  accepted  or  rejected  and  whether  the 
trust-region  radius  should  be  adjusted.  A  variety  of  merit  functions  have  been  used  in 
SQP  methods,  including  nonsmooth  penalty  functions  and  augmented  Lagrangians.  We 
limit  our  discussion  to  exact,  nonsmooth  merit  functions  typified  by  the  i\  merit  function 
discussed  in  Chapters  15  and  17. 

For  the  purpose  of  step  computation  and  evaluation  of  a  merit  function,  inequality 
constraints  c(x)  >  0  are  often  converted  to  the  form 

c(x,  s)  —  c(x)  —5  =  0, 

where  s  >  0  is  a  vector  of  slacks.  (The  condition  s  >  0  is  typically  not  monitored  by  the 
merit  function.)  Therefore,  in  the  discussion  that  follows  we  assume  that  all  constraints  are 
in  the  form  of  equalities,  and  we  focus  our  attention  on  problem  (18.1). 

The  i\  merit  function  for  (18.1)  takes  the  form 

<pi(x;  n)  =  f(x)  + fi\\c(x)\\i.  (18.27) 

In  a  line  search  method,  a  step  ak  pk  will  be  accepted  if  the  following  sufficient  decrease 
condition  holds: 


<Pi(xk  +  akpk\ nk)  <  <Pi(xk,  p.k)  +  r] akD(4»i{xk\  p.)\  pk), 


rj  e  (0,  1),  (18.28) 
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where  D((pi(xk\  p)\  pu)  denotes  the  directional  derivative  of  <p\  in  the  direction  pk.  This 
requirement  is  analogous  to  the  Armijo  condition  (3.4)  for  unconstrained  optimization 
provided  that  pu  isadescent  direction,  thatis,  D(<pi(xk\  /x);  pk)  <  0.  This  descent  condition 
holds  if  the  penalty  parameter  p  is  chosen  sufficiently  large,  as  we  show  in  the  following  result. 

Theorem  18.2. 

Let  pk  and  Xk+ i  be  generated  by  the  SQP  iteration  (18.9).  Then  the  directional  derivative 
of  (pi  in  the  direction  pk  satisfies 

D[(pi(xk\  p)\  Pk)  —  V ffi pk  -  Mllotlli.  (18.29) 


Moreover,  we  have  that 

D((pi(xk;  p);  pk)  <  -pTkV2xxCkPk  ~  (p  ~  II^-+iIIoo)IIqIIi-  (18.30) 

PROOF.  By  applying  Taylor’s  theorem  (see  (2.5))  to  /  and  c, , i  =  1,2,...,  m,  we  obtain 

(pi(xk  +  otp\  p)  -  <pi(xk\  p)  —  f(xk  +  ap)  -  fk  +  p\\c(xk  +  otp)\\i  -  p\\ck\\i 

<  a V  ffi p  +  Yu2\\p\\2  +  p\\ck  +aAkp\\i  -  p\\ck\\i, 

where  the  positive  constant  y  bounds  the  second-derivative  terms  in  /  and  c.  If  p  —  pk  is 
given  by  (18.9),  we  have  that  AkPk  —  —Ck,  so  for  a  <  1  we  have  that 

<pi(xk  +  otpk\  p)  -  (pi(xk\  p)  <  o/[Vfk  Pk  -  MllOtlli]  +  ot2y\\pk\\2- 

By  arguing  similarly,  we  also  obtain  the  following  lower  bound: 

(pi(xk  +  a pk;  p)  -  <pi(xk;  p)  >  ff[V/Ar pk  -  At||c*||i]  -  a2y\\pk\\2. 

Taking  limits,  we  conclude  that  the  directional  derivative  of  cp\  in  the  direction  pk  is  given  by 

D{<pi(xk;  p);  pk)  -  V  ffi pk  -  p\\ck\\i,  (18.31) 

which  proves  (18.29).  The  fact  that  pk  satisfies  the  first  equation  in  ( 18.9)  implies  that 

D((pi{xk\  p);  Pk)  —  - Pk^xx^kPk  +  Pk  Ak  Lk+\  -  p\\ck\\i. 

From  the  second  equation  in  (18.9),  we  can  replace  the  term  pf  Af  Xk+i  in  this  expression  by 
— c^Xk+i.  By  making  this  substitution  in  the  expression  above  and  invoking  the  inequality 

2.4.4. i  —  IMlU^+llloo, 


we  obtain  (18.30). 


□ 
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It  follows  from  (18.30)  that  pk  will  be  a  descent  direction  for  (pi  if  pk  /  0,  Vxt£/.  is 
positive  definite  and 


/4>P*+illoo.  (18-32) 

(A  more  detailed  analysis  shows  that  this  assumption  on  VxxCk  can  be  relaxed;  we  need 
only  the  reduced  Hessian  ZAr  Vxl.  Ck  Zk  to  be  positive  definite. ) 

One  strategy  for  choosing  the  new  value  of  the  penalty  parameter  p  in  (pi{x\  p)  at 
every  iteration  is  to  increase  the  previous  value,  if  necessary,  so  as  to  satisfy  (18.32),  with 
some  margin.  It  has  been  observed,  however,  that  this  strategy  may  select  inappropriate 
values  of  p  and  often  interferes  with  the  progress  of  the  iteration. 

An  alternative  approach,  based  on  ( 18.29),  is  to  require  that  the  directional  derivative 
be  sufficiently  negative  in  the  sense  that 

D{(pi(xk\  p)\  pk)  =  V fk  pk  -  mIIqIIi  <  ~pp\\ck\\i, 


for  some  p  e  (0,  1).  This  inequality  holds  if 


P  > 


v/*>* 

(i  -p)iic*nr 


(18.33) 


This  choice  is  not  dependent  on  the  Lagrange  multipliers  and  performs  adequately  in 
practice. 

A  more  effective  strategy  for  choosing  p,  which  is  appropriate  both  in  the  line  search 
and  trust-region  contexts,  considers  the  effect  of  the  step  on  a  model  of  the  merit  function. 
We  define  a  (piecewise)  quadratic  model  of  <p\  by 


<1k(p)  =  fk  +  V//  p  +  <^pTV2xxCkp  +  pm{p),  (18.34) 


where 


m{p)  —  ||  c*  +  Akp\\u 

and  a  is  a  parameter  to  be  defined  below.  After  computing  a  step  pk,  we  choose  the  penalty 
parameter  p  large  enough  that 

9^(0)  -  <ln (Pk)  >  PP\m( 0)  -  m{pk)],  (18.35) 

for  some  parameter  p  e  (0,  1).  It  follows  from  (18.34)  and  (18.7b)  that  inequality  (18.35) 
is  satisfied  for 


V// Pk  +  (<t/2 )pTk  V2xx£kpk 

(1  -  p)IIqIIi 


(18.36) 


If  the  value  of  p  from  the  previous  iteration  of  the  SQP  method  satisfies  (18.36),  it  is  left 
unchanged.  Otherwise,  p  is  increased  so  that  it  satisfies  this  inequality  with  some  margin. 
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The  constant  a  is  used  to  handle  the  case  in  which  the  Hessian  is  not  positive 

definite.  We  define  a  as 


a  — 


1  tipTkV2KXCkPk>  0, 

0  otherwise. 


(18.37) 


Itis  easy  to  verify  that,  if  /u  satisfies  (18. 36),  this  choice  of  cr  ensures  that  D{(p\{xk\  n)\  Pk)  < 
— ppi\\ck\\  i,  so  that  pk  is  a  descent  direction  for  the  merit  function  (pi.  This  conclusion  is  not 
always  valid  if  a  =  1  and  pj  V*x £k Pk  <  0.  By  comparing  (18.33)  and  (18.36)  we  see  that, 
when  cr  >  0,  the  strategy  based  on  (18.35)  selects  a  larger  penalty  parameter,  thus  placing 
more  weight  on  the  reduction  of  the  constraints.  This  property  is  advantageous  if  the  step 
Pk  decreases  the  constraints  but  increases  the  objective,  for  in  this  case  the  step  has  a  better 
chance  of  being  accepted  by  the  merit  function. 


SECOND-ORDER  CORRECTION 

In  Chapter  15,  we  showed  by  means  of  Example  15.4  that  many  merit  functions  can 
impede  progress  of  an  optimization  algorithm,  a  phenomenon  known  as  the  Maratos  effect. 
We  now  show  that  the  step  analyzed  in  that  example  is,  in  fact,  produced  by  an  SQP  method. 


□  Example  18.1  (Example  15.4,  Revisited) 


Consider  problem  (15.34).  At  the  iterate  Xk  —  (cos  0,  sin  0)T ,  let  us  compute  a  search 
direction  pk  by  solving  the  SQP  subproblem  (18.7)  with  V(x£i-  replaced  by  VxxC(x*.  A.*)  = 
I .  Since 


fk  —  cos  0 ,  Vfk 


4  cos  6  —  1 
4sin0 


2  cos  6 
2  sind 


the  quadratic  subproblem  (18.7)  takes  the  form 

1  2  1  2 

min  (4  cos  6  —  1  )p\  +  4  sin  0p2  +  -p1  +  -p2 
subject  to  p 2  +  cotdpi  =  0. 


By  solving  this  subproblem,  we  obtain  the  direction 

sin2  6 

Pk  — 

—  sin  6  cos  0 


(18.38) 


□ 


which  coincides  with  ( 15.35). 
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We  mentioned  in  Section  15.4  that  the  difficulties  associated  with  the  Maratos  effect 
can  be  overcome  by  means  of  a  second-order  correction.  There  are  various  ways  of  applying 
this  technique;  we  describe  one  possible  implementation  next. 

Suppose  that  the  SQP  method  has  computed  a  step  pk  from  ( 18.1 1).  If  this  step  yields 
an  increase  in  the  merit  function  <f>i,  a  possible  cause  is  that  our  linear  approximations  to 
the  constraints  are  not  sufficiently  accurate.  To  overcome  this  deficiency,  we  could  re-solve 
(18.11)  with  the  linear  terms  a  (xk )  +  V c,-  {xk ) T  p  replaced  by  quadratic  approximations, 

Ci{xk)  +  Va(xk)T  p  +  jpTV2a(xk)p.  (18.39) 

However,  even  if  the  Hessians  of  the  constraints  are  individually  available,  the  resulting 
quadratically  constrained  subproblem  may  be  too  difficult  to  solve.  Instead,  we  evaluate  the 
constraint  values  at  the  new  point  xk  +  pk  and  make  use  of  the  following  approximations. 
By  Taylor’s  theorem,  we  have 

Ci(xk  +  pk )  «  Ci(xk)  +  Vcdxk)Tpk  +  \pTk  V2a{xk)pk.  (18.40) 

Assuming  that  the  (still  unknown)  second-order  step  p  will  not  be  too  different  from  pk, 
we  can  approximate  the  last  term  in  (18.39)  as  follows: 


pTV2a(xk)p  =  pTkV2Ci{xk)pk.  (18.41) 

By  making  this  substitution  in  (18.39)  and  using  (18.40),  we  obtain  the  second-order 
correction  subproblem 


min  V fk  p+\pTV2xxCkp 

P 

subject  to  Vcj(xk)T p  +  dj  =0,  i  e  £, 

Vci(xk)T p  +  di  >0,  i  e  X, 

where 

di  =  Cj (xk  +  pk)  -  Vci{xk)T pk,  i  e  £  UI. 

The  second-order  correction  step  requires  evaluation  of  the  constraints  Cj(xk  +  pk) 
for  i  e  £UI,  and  therefore  it  is  preferable  not  to  apply  it  every  time  the  merit  function 
increases.  One  strategy  is  to  use  it  only  if  the  increase  in  the  merit  function  is  accompanied 
by  an  increase  in  the  constraint  norm. 

It  can  be  shown  that  when  the  step  pk  is  generated  by  the  SQP  method  (18.1 1)  then, 
near  a  solution  satisfying  second-order  sufficient  conditions,  the  algorithm  above  takes 
either  the  full  step  pk  or  the  corrected  step  pk  +  pk.  The  merit  function  does  not  interfere 
with  the  iteration,  so  superlinear  convergence  is  attained,  as  in  the  local  algorithm. 


18.4.  A  Practical  Line  Search  SQP  Method  545 


1 8.4  A  PRACTICAL  LINE  SEARCH  SQP  METHOD 


From  the  discussion  in  the  previous  section,  we  can  see  that  there  is  a  wide  variety  of  line 
search  SQP  methods  that  differ  in  the  way  the  Hessian  approximation  is  computed,  in  the 
step  acceptance  mechanism,  and  in  other  algorithmic  features.  We  now  incorporate  some  of 
these  ideas  into  a  concrete,  practical  SQP  algorithm  for  solving  the  nonlinear  programming 
problem  (18.10).  To  keep  the  description  simple,  we  will  not  include  a  mechanism  such 
as  (18.12)  to  ensure  the  feasibility  of  the  subproblem,  or  a  second-order  correction  step. 
Rather,  the  search  direction  is  obtained  simply  by  solving  the  subproblem  (18.11).  We  also 
assume  that  the  quadratic  program  (18.11)  is  convex,  so  that  we  can  solve  it  by  means  of  the 
active-set  method  for  quadratic  programming  (Algorithm  16.3)  described  in  Chapter  16. 

Algorithm  18.3  (Line  Search  SQP  Alsorithm). 

Choose  parameters  rj  e  (0,  0.5),  r  e  (0,  1),  and  an  initial  pair  (x0,  A.0); 

Evaluate  /0,  V/0,  c0,  A0; 

If  a  quasi-Newton  approximation  is  used,  choose  an  initial  n  x  n  symmetric 
positive  definite  Hessian  approximation  Bo,  otherwise  compute  WxxCo; 
repeat  until  a  convergence  test  is  satisfied 

Compute  pk  by  solving  (18.1 1);  let  X  be  the  corresponding  multiplier; 

Set  px  a-  X  —  Xk; 

Choose  pk  to  satisfy  (18.36)  with  a  —  1; 

Set  ak  a—  1 ; 

while  cf>i(xk  +  akpk\  pk)  >  (pi(xk-  Pk)  +  W  A  (</>(*«.•;  pk)pk) 

Resets  a—  raak  for  some  ra  e  (0,  r]; 

end  (while) 

Set  xk+i  <-  xk  +  a k pk  and  Xk+1  a-  Xk  +  akpk; 

Evaluate  fk+u  V  fk+l,ck+l,  Ak+U  (and  possibly  V2xCk+l); 

If  a  quasi-Newton  approximation  is  used,  set 

Sk  •<-  ockpk  and  yk  a-  VxC(xk+i,  Xk+l)  -  VxC(xk,  Xk+1), 

and  obtain  Bk+i  by  updating  Bk  using  a  quasi-Newton  formula; 
end  (repeat) 

We  can  achieve  significant  savings  in  the  solution  of  the  quadratic  subproblem 
by  warm-start  procedures.  For  example,  we  can  initialize  the  working  set  for  each  QP 
subproblem  to  be  the  final  active  set  from  the  previous  SQP  iteration. 

We  have  not  given  particulars  of  the  quasi-Newton  approximation  in  Algorithm  18.3. 
We  could  use,  for  example,  a  limited-memory  BFGS  approach  that  is  suitable  for  large-scale 
problems.  If  we  use  an  exact  Hessian  VxxCk,  we  assume  that  it  is  modified  as  necessary  to 
be  positive  definite  on  the  null  space  of  the  equality  constraints. 

Instead  of  a  merit  function,  we  could  employ  a  filter  (see  Section  15.4)  in  the  inner 
“while”  loop  to  determine  the  steplength  ak.  As  discussed  in  Section  15.4,  a  feasibility 
restoration  phase  is  invoked  if  a  trial  steplength  generated  by  the  backtracking  line  search  is 
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smaller  than  a  given  threshold.  Regardless  of  whether  a  merit  function  or  filter  are  used,  a 
mechanism  such  as  second-order  correction  can  be  incorporated  to  overcome  the  Maratos 
effect. 


1 8.5  TRUST-REGION  SQP  METHODS 


Trust-region  SQP  methods  have  several  attractive  properties.  Among  them  are  the  facts  that 
they  do  not  require  the  Hessian  matrix  Vxx£k  in  (18.1 1)  to  be  positive  definite,  they  control 
the  quality  of  the  steps  even  in  the  presence  of  Hessian  and  Jacobian  singularities,  and  they 
provide  a  mechanism  for  enforcing  global  convergence.  Some  implementations  follow  an 
IQP  approach  and  solve  an  inequality-constrained  subproblem,  while  others  follow  an  EQP 
approach. 

The  simplest  way  to  formulate  a  trust-region  SQP  method  is  to  add  a  trust-region 
constraint  to  subproblem  (18.11),  as  follows: 


min  fk  +  V  fkr p  +  \pTV2  Ckp 

P 

(18.43a) 

subject  to  Vcj(xk)T p  +  Cj(xk )  =  0, 

i  e  £, 

(18.43b) 

Va(xk)T p  +  Ci(xk)  >  0, 

i  €  1, 

(18.43c) 

llpll  <  a k. 

(18.43d) 

Even  if  the  constraints  (18.43b),  (18.43c)  are  compatible,  this  problem  may  not  always  have  a 
solution  because  of  the  trust-region  constraint  (18.43d).  We  illustrate  this  fact  in  Figure  18.1 
for  a  problem  that  contains  only  one  equality  constraint  whose  linearization  is  represented 
by  the  solid  line.  In  this  example,  any  step  p  that  satisfies  the  linearized  constraint  must  lie 
outside  the  trust  region,  which  is  indicated  by  the  circle  of  radius  A/,.  As  we  see  from  this 
example,  a  consistent  system  of  equalities  and  inequalities  may  not  have  a  solution  if  we 
restrict  the  norm  of  the  solution. 

To  resolve  the  possible  conflict  between  the  linear  constraints  (18.43b),  (18.43c)  and 
the  trust-region  constraint  (18.43d),  it  is  not  appropriate  simply  to  increase  A*  until  the  set 
of  steps  p  satisfying  the  linear  constraints  intersects  the  trust  region.  This  approach  would 
defeat  the  purpose  of  using  the  trust  region  in  the  first  place  as  a  way  to  define  a  region 
within  which  we  trust  the  model  (18.43a)-(18.43c)  to  accurately  reflect  the  behavior  of  the 
objective  and  constraint  functions.  Analytically,  it  would  harm  the  convergence  properties 
of  the  algorithm. 

A  more  appropriate  viewpoint  is  that  there  is  no  reason  to  satisfy  the  linearized 
constraints  exactly  at  every  step;  rather,  we  should  aim  to  improve  the  feasibility  of  these 
constraints  at  each  step  and  to  satisfy  them  exactly  only  if  the  trust- region  constraint  permits 
it.  This  point  of  view  is  the  basis  of  the  three  classes  of  methods  discussed  in  this  section: 
relaxation  methods,  penalty  methods,  and  filter  methods. 
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A  RELAXATION  METHOD  FOR  EQUALITY-CONSTRAINED  OPTIMIZATION 

We  describe  this  method  in  the  context  of  the  equality-constrained  optimization 
problem  (18.1);  its  extension  to  general  nonlinear  programs  is  deferred  to  Chapter  19 
because  it  makes  use  of  interior-point  techniques.  (Active-set  extensions  of  the  relaxation 
approach  have  been  proposed,  but  have  not  been  fully  explored.) 

At  the  iterate  Xk,  we  compute  the  SQP  step  by  solving  the  subproblem 

min  fk  +  V  fjJ  p  +  -pTV2  Ckp  (18.44a) 

p  2 

subject  to  Akp  +  Ck  —  (18.44b) 

IIP II 2  <  A,.  (18.44c) 

The  choice  of  the  relaxation  vector  rk  requires  careful  consideration,  as  it  impacts  the 
efficiency  of  the  method.  Our  goal  is  to  choose  r*  as  the  smallest  vector  such  that  (18.44b), 
(18.44c)  are  consistent  for  some  reduced  value  of  trust-region  radius  A  k-  To  do  so,  we  first 
solve  the  subproblem 

min  \\AkV  +  CkWl  (18.45a) 

V 

subjectto  ||v||2  <  0.8A*.  (18.45b) 

Denoting  the  solution  of  this  subproblem  by  14-,  we  define 


rk  —  A-kVk  +  Ck- 


(18.46) 
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We  now  compute  the  step  pk  by  solving  (18.44),  define  the  new  iterate  xk+ 1  =  Xk  +  pk, 
and  obtain  new  multiplier  estimates  iCt+i  using  the  least  squares  formula  (18.21).  Note  that 
the  constraints  (18.44b),  (18.44c)  are  consistent  because  they  are  satisfied  by  the  vector 
P  —  Vk- 

At  first  glance,  this  approach  appears  to  be  impractical  because  problems  (18.44) 
and  (18.45)  are  not  particularly  easy  to  solve,  especially  when  xCk  is  indefinite.  Fortu¬ 
nately,  we  can  design  efficient  procedures  for  computing  useful  inexact  solutions  of  these 
problems. 

We  solve  the  auxiliary  subproblem  (18.45)  by  the  dogleg  method  described  in  Chap¬ 
ter  4.  This  method  requires  a  Cauchy  step  p",  which  is  the  minimizer  of  the  objective 
(18.45a)  along  the  direction  —A^ck,  and  a  “Newton  step”  pB,  which  is  the  unconstrained 
minimizer  of  (18.45a).  Since  the  Hessian  in  (18.45a)  is  singular,  there  are  infinitely  many 
possible  choices  of  pB,  all  of  which  satisfy  AkpB  +  c>  =  0.  We  choose  the  one  with  smallest 
Euclidean  norm  by  setting 


PB  =  ~ATk[AkATk]  lck. 

We  now  take  vk  to  be  the  minimizer  of  (18.45a)  along  the  path  defined  by  pv,  pB,  and  the 
formula  (4.16). 

The  preferred  technique  for  computing  an  approximate  solution  pk  of  (18.44)  is  the 
projected  conjugate  gradient  method  of  Algorithm  16.2.  We  apply  this  algorithm  to  the 
equality-constrained  quadratic  program  ( 18.44a) — (18.44b),  monitoring  satisfaction  of  the 
trust-region  constraint  (18.44c)  and  stopping  if  the  boundary  of  this  region  is  reached  or 
if  negative  curvature  is  detected;  see  Section  7.1.  Algorithm  16.2  requires  a  feasible  starting 
point,  which  may  be  chosen  as  vk. 

A  merit  function  that  fits  well  with  this  approach  is  the  nonsmooth  f2  function 
</>2  (x;  ji)  =  f(x)  +  p\\c{x)\\2.  We  model  it  by  means  of  the  function 

q^p)  =  fk  +  v//  p  +  i pTW2xxCkp  +  pm(p),  (18.47) 


where 


m(p)  =  II  c*  +  AkPh, 

(see  (18.34)).  We  choose  the  penalty  parameter  large  enough  that  inequality  (18.35)  is 
satisfied.  To  judge  the  acceptability  of  a  step  pk,  we  monitor  the  ratio 

ared*  <h(*k,  P)  ~  <Pi{xk  +  pk,  p)  0 

Pk  —  - =  - ■  (18.48) 

predA.  qixW-q^(pk) 

We  can  now  give  a  description  of  this  trust-region  SQP  method  for  the  equality- 
constrained  optimization  problem  (18.1). 
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Algorithm  18.4  (Byrd-Omojokun  Trust-Resion  SQP  Method). 

Choose  constants  e  >  0  and  rj,y  G  ( 0,1); 

Choose  starting  point  Xq,  initial  trust  region  A0  >  0; 
for  k  —  0,  1,  2, . . . 

Compute  fk,  ck,  V  fk,  Ak; 

Compute  multiplier  estimates  Xk  by  (18.21); 
if  II V/t  -  ATkik\ |oo<  e  and  ||c*||oo  <  e 
stop  with  approximate  solution  xk ; 

Solve  normal  subproblem  ( 18.45)  for  vk  and  compute  rk  from  (18.46); 
Compute  ^lxCk  or  a  quasi-Newton  approximation; 

Compute  pk  by  applying  the  projected  CG  method  to  (18.44); 

Choose  \ik  to  satisfy  (18.35); 

Compute  pk  —  ared^/predA.; 
if  pk  >  r] 

Set  xk+i  —  xk  +  pk; 

Choose  A*+i  to  satisfy  A^+i  >  A*; 

else 

Set  xk+ ]  —  x 

Choose  A*+i  to  satisfy  A*+1  <  y||p*||; 

end  (for). 


A  second-order  correction  can  be  added  to  avoid  the  Maratos  effect.  Beyond  the  cost 
of  evaluating  the  objective  function  f  and  constraints  c,  the  main  costs  of  this  algorithm 
lie  in  the  projected  CG  iteration,  which  requires  products  of  the  Hessian  VxxCk  with 
vectors,  and  in  the  factorization  and  backsolves  with  the  projection  matrix  (16.32);  see 
Section  16.3. 


SfiQP  (SEQUENTIAL  tx  QUADRATIC  PROGRAMMING) 

In  this  approach  we  move  the  linearized  constraints  (18.43b),  (18.43c)  into  the  ob¬ 
jective  of  the  quadratic  program,  in  the  form  of  an  i  x  penalty  term,  to  obtain  the  following 
subproblem: 


min 

p 


<hi(p)  =  fk  +  V/tr p  +  ]-pTVxxCkp  +  P  I Cj(xk)  +  Va(xk)T p\ 

ieS 


Ci(xk)  +  VCi(xk)Tp] 

iel 


(18.49) 


subject  to  ||p || oo  <  Ak, 
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for  some  penalty  parameter  p,  where  we  use  the  notation  [y]  =  max{0,  —  y }.  Introducing 

slack  variables  v,  w,  t,  we  can  reformulate  this  problem  as  follows: 


min  fk  +  V//p  +  ~pTV2  Ckp  +  /iVfe  +  u>,)  +  p  V' f;  (18.50a) 

p,v,w,t  2  z — J  z — ' 


ieS  i€X 

s.t.  Vci(xk)T p  +  Ci{xk )  =  Vi  —  Wj,  i  e  £,  (18.50b) 

Vcj(xk)T p  +  Ci(xk)  >  —  f,-,  i  el,  (18.50c) 

v,  w,  t  >  0,  (18.50d) 

II Z7 II oo  <  A,.  (18.50e) 


This  formulation  is  simply  a  linearization  of  the  elastic-mode  formulation  (18.12)  with  the 
addition  of  a  trust-region  constraint. 

The  constraints  of  this  problem  are  always  consistent.  Since  the  trust  region  has  been 
defined  using  the  l0 0  norm,  (18.50)  is  a  smooth  quadratic  program  that  can  be  solved 
by  means  of  a  quadratic  programming  algorithm.  Warm-start  strategies  can  significantly 
reduce  the  solution  time  of  (18.50)  and  are  invariably  used  in  practical  implementations. 

It  is  natural  to  use  the  l\  merit  function 

0i U;  M)  =  /(•*)  +  ix  ^  |c;(x)|  +  (18.51) 

i  €£  i  gX 


to  determine  step  acceptance.  In  fact,  the  function  qlx  defined  in  (18.49)  can  be  viewed 
as  a  model  of  (pi{x,  p.)  at  Xk  in  which  we  approximate  each  constraint  function  c;  by 
its  linearization,  and  replace  /  by  a  quadratic  function  whose  curvature  term  includes 
information  from  both  objective  and  constraints. 

After  computing  the  step  pk  from  (18.50),  we  determine  the  ratio  pk  via  (18.48), 
using  the  merit  function  <p\  and  defining  q M  by  (18.49).  The  step  is  accepted  or  rejected 
according  to  standard  trust- region  rules,  as  implemented  in  Algorithm  18.4.  A  second-order 
correction  step  can  be  added  to  prevent  the  occurence  of  the  Maratos  effect. 

The  SfiQP  approach  has  several  attractive  properties.  Not  only  does  the  formula¬ 
tion  (18.49)  overcome  the  possible  inconsistency  among  the  linearized  constraints,  but 
it  also  ensures  that  the  trust-region  constraint  can  always  be  satisfied.  Further,  the  ma¬ 
trix  V^xCk  can  be  used  without  modification  in  subproblem  (18.50)  or  else  can  be 
replaced  by  a  quasi-Newton  approximation.  There  is  no  requirement  for  it  to  be  positive 
definite. 

This  choice  of  the  penalty  parameter  p  plays  an  important  role  in  the  efficiency 
of  this  method.  Unlike  the  SQP  methods  described  above,  which  use  a  penalty  function 
only  to  determine  the  acceptability  of  a  trial  point,  the  step  pk  of  the  SfiQP  algorithm 
depends  on  p.  Values  of  p  that  are  too  small  can  lead  the  algorithm  away  from  the  solution 
(Section  17.2),  while  excessively  large  values  can  result  in  slow  progress.  To  obtain  good 
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practical  performance  over  a  range  of  applications,  the  value  of  fi  must  be  chosen  carefully 
at  each  iteration;  see  Algorithm  18.5  below. 

SEQUENTIAL  LINEAR-QUADRATIC  PROGRAMMING  (SLOP) 

The  SQP  methods  discussed  above  require  the  solution  of  a  general  (inequality- 
constrained)  quadratic  problem  at  each  iteration.  The  cost  of  solving  this  subproblem 
imposes  a  limit  on  the  size  of  problems  that  can  be  solved  in  practice.  In  addition,  the 
incorporation  of  (indefinite)  second  derivative  information  in  SQP  methods  has  proved  to 
be  difficult  [147]. 

The  sequential  linear-quadratic  programming  (SLQP)  method  attempts  to  overcome 
these  concerns  by  computing  the  step  in  two  stages,  each  of  which  scales  well  with  the 
number  of  variables.  First,  a  linear  program  (LP)  is  solved  to  identify  a  working  set  W. 
Second,  there  is  an  equality-constrained  quadratic  programming  (EQP)  phase  in  which  the 
constraints  in  the  working  set  W  are  imposed  as  equalities.  The  total  step  of  the  algorithm 
is  a  combination  of  the  steps  obtained  in  the  linear  programming  and  equality-constrained 
phases,  as  we  now  discuss. 

In  the  LP  phase,  we  would  like  to  solve  the  problem 


min 

p 

fk  +  VfkTP 

(18.52a) 

subject  to 

Ci(xk)  +  Vcj(xk)T  p  =  0, 

i  e  £, 

(18.52b) 

Ci{xk)  +  Vci(xk)T p  >  0, 

i  e  1, 

(18.52c) 

llplloo  <  Af, 

(18.52d) 

which  differs  from  the  standard  SQP  subproblem  (18.43)  only  in  that  the  second-order 
term  in  the  objective  has  been  omitted  and  that  an  norm  is  used  to  define  the  trust 
region.  Since  the  constraints  of  (18.52)  may  be  inconsistent,  we  solve  instead  the  i\  penalty 
reformulation  of  (18.52)  defined  by 

min  l^p)  =  fk  +  V// p  +  p  V)  \ a(xk)  +  Vci(xk)Tp\ 
p  c 

ie.S 

+  *!>(*)  +  Vci{xk)T  p]~  (18.53a) 

i€l 

subject  to  ||/t||oo  <  Aj:p.  (18.53b) 

By  introducing  slack  variables  as  in  ( 18.50),  we  can  reformulate  (18.53)  as  an  LP.  The  solution 
of  (18.53),  which  we  denote  by  pL?,  is  computed  by  the  simplex  method  (Chapter  13).  From 
this  solution  we  obtain  the  following  explicit  estimate  of  the  optimal  active  set: 


Ak(pLf)  =  {i  e£  \  Ci{xk)  +  Vci{xk)T  pLF  =  0}  U  {i  e  1  \  Ci(xk)  +  Vc,(.^.)7'  pLV  =  0}. 
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Likewise,  we  define  the  set  V*  of  violated  constraints  as 

W)  =  {ie£  |  a(xk)  +  Vci(xk)T pLr  +  0}  U  {i  e  X  \  a(xk)  +  VCi(xQ  V  <  0}. 

We  define  the  working  set  Wk  as  some  linearly  independent  subset  of  the  active  set  Ak  (pLP). 
To  ensure  that  the  algorithm  makes  progress  on  the  penalty  function  </>i,  we  define  the 
Cauchy  step, 


pc  —  aLP//p, 


(18.54) 


where  aLP  e  (0,  1]  is  a  steplength  that  provides  sufficient  decrease  in  the  model  q ^  defined 
in  (18.49). 

Given  the  working  set  Wk,  we  now  solve  an  equality-constrained  quadratic  program 
(EQP)  treating  the  constraints  in  Wk  as  equalities  and  ignoring  all  others.  We  thus  obtain 
the  subproblem 


min  fk  +  \pTV2xxCkp  +  I  V/*  +  p.k  K;Vc;(.y t)  I  P  (18.55a) 

P  \  i€Vk  ) 

subject  to  Cj{xk)  +  y  Ci{xk)T  p  =  0,  ie£C\Wk,  (18.55b) 

Ci(xk)  +  Vci{xk)Tp  —  0,  ieXCiWk,  (18.55c) 

||p II 2  <  At,  (18.55d) 

where  y,  is  the  algebraic  sign  of  the  i -th  violated  constraint.  Note  that  the  trust  region 
(18.55d)  is  spherical,  and  that  Ak  is  distinct  from  the  trust-region  radius  A^p  used  in 
(18.53b).  Problem  (18.55)  is  solved  for  the  vector  pQ  by  applying  the  projected  conjugated 
gradient  procedure  of  Algorithm  16.2,  handling  the  trust-region  constraint  by  Steihaug’s 
strategy  (Algorithm  7.2).  The  total  step  pk  of  the  SLQP  method  is  given  by 


pk  =  pc  +  a^p«-pc). 


where  aQ  e  [0,  1]  is  a  steplength  that  approximately  minimizes  the  model  defined  in 
(18.49). 

The  trust-region  radius  Ak  for  the  EQP  phase  is  updated  using  standard  trust-region 
update  strategies.  The  choice  of  radius  Aj,p+|  for  the  LP  phase  is  more  delicate,  since  it 
influences  our  guess  of  the  optimal  active  set.  The  value  of  Aj,p+]  should  be  set  to  be  a  little 
larger  than  the  total  step  pk,  subject  to  some  other  restrictions  [49] .  The  multiplier  estimates 
Xk  used  in  the  Hessian  VxxCk  are  least  squares  estimates  (18.21)  using  the  working  set 
and  modified  so  that  1/  >  0  for  i  e  X. 

An  appealing  feature  of  the  SLQP  algorithm  is  that  established  techniques  for  solving 
large-scale  versions  of  the  LP  and  EQP  subproblems  are  readily  available.  High  quality  LP 
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software  is  capable  of  solving  problems  with  very  large  numbers  of  variables  and  constraints, 
while  the  solution  of  the  EQP  subproblem  can  be  performed  efficiently  using  the  projected 
conjugate  gradient  method. 

A  TECHNIQUE  FOR  UPDATING  THE  PENALTY  PARAMETER 

We  have  mentioned  that  penalty  methods  such  as  S£  i  QP  and  SLQP  can  be  sensitive  to 
the  choice  of  the  penalty  parameter  p.  We  now  discuss  a  procedure  for  choosing  p  that  has 
proved  to  be  effective  in  practice  and  is  supported  by  global  convergence  guarantees.  The 
goal  is  to  choose  p  small  enough  to  avoid  an  unnecessary  imbalance  in  the  merit  function, 
but  large  enough  to  cause  the  step  to  make  sufficient  progress  in  linearized  feasibility  at  each 
iteration.  We  present  this  procedure  in  the  context  of  the  SfiQP  method  and  then  describe 
its  extension  to  the  SLQP  approach. 

We  define  a  piecewise  linear  model  of  constraint  violation  at  a  point  xk  by 

mk(p)  —  ^  I Ci(xk)  +  Vci{xk)T p\  +  y^[c,-fa)  +  Vci{xk)T p)]~,  (18.56) 

ieS  iel 

so  that  the  objective  of  the  SQP  subproblem  (18.49)  can  be  written  as 

q^ip)  —  fk  +  V//  p  +  -pTVxxCkp  +  pmk{p).  (18.57) 

We  begin  by  solving  the  QP  subproblem  (18.49)  (or  equivalently,  (18.50))  using  the  previous 
value  pk-\  of  the  penalty  parameter.  If  the  constraints  (18.50b),  (18.50c)  are  satisfied  with 
the  slack  variables  Vj,  w,-,  tj  all  equal  to  zero  (that  is,  mk{pk)  =  0),  then  the  current 
value  of  p  is  adequate,  and  we  set  pk  —  pk-\.  This  is  the  felicitous  case  in  which  we  can 
achieve  linearized  feasibility  with  a  step  pk  that  is  no  longer  in  norm  than  the  trust-region 
radius. 

If  mk(p)  >  0,  on  the  other  hand,  it  may  be  appropriate  to  increase  the  penalty 
parameter.  The  question  is:  by  how  much?  To  obtain  a  reference  value,  we  re-solve  the 
QP  (18.49)  using  an  “infinite”  value  of  p,  by  which  we  mean  that  the  objective  function 
in  (18.49)  is  replaced  by  mk(p).  After  computing  the  new  step,  which  we  denote  by  p0 0, 
two  outcomes  are  possible.  If  mk(poo)  —  0,  meaning  that  the  linearized  constraints  are 
feasible  within  the  trust  region,  we  choose  pk  >  pk-i  such  that  mk(pk )  —  0.  Otherwise,  if 
m-kipoo )  >  0,  we  choose  pk  >  pk-\  such  that  the  reduction  in  mk  caused  by  the  step  pk  is 
at  least  a  fraction  of  the  (optimal)  reduction  given  by  p^. 

The  selection  of  pk  >  pk-\  is  achieved  in  all  cases  by  successively  increasing  the 
current  trial  value  of  p  (by  a  factor  of  10,  say)  and  re-solving  the  quadratic  program  (18.49). 
To  describe  this  strategy  more  precisely,  we  write  the  solution  of  the  QP  problem  (18.49)  as 
p{p)  to  stress  its  dependence  on  the  penalty  parameter.  Likewise,  p^  denotes  the  minimizer 
of  mk{p)  subject  to  the  trust-region  constraint  (18.50e).  The  following  algorithm  describes 
the  selection  of  the  penalty  parameter  pk  and  the  computation  of  the  S£  iQP  step  pk. 
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Algorithm  18.5  (Penalty  Update  and  Step  Computation). 

Initial  data:  xk,  pk~i  >  0,  A ^  >  0,  and  parameters  €\ ,  e2  e  (0,  1). 

Solve  the  subproblem  (18.50)  with  p  —  pk- 1  to  obtain  p(Hk- 1); 

if  mk(p{nk-i))  =  0 

Set  p+  < 

else 

Compute  poo; 
ifmk(poo)  —  0 

Find/u+  >  pk- 1  such  that  mk(p(p+))  =  0; 

else 

Find  p+  >  pk- 1  such  that 

m*(0)  -  mk(p(p+))  >  ei[w*(0)  -  m*(/?oo)]; 

end  (if) 
end  (if) 

Increase  /x+  if  necessary  to  satisfy 

<2V+(0)  -  qkL+ip(p+))  >  e2/T+[w*(0)  -  »u-(p(/r+))]; 
Set  yu.*  4-  p+  and  pk  p(p+). 


(Note  that  the  inequality  in  the  penultimate  line  is  the  same  as  condition  (18.35).) 
Although  Algorithm  18.5  requires  the  solution  of  some  additional  quadratic  programs, 
we  hope  to  reduce  the  total  number  of  iterations  (and  the  total  number  of  QP  solves) 
by  identifying  an  appropriate  penalty  parameter  value  more  quickly  than  rules  based  on 
feasibility  monitoring  (see  Framework  17.2). 

Numerical  experience  indicates  that  these  savings  occur  when  an  adaptation  of  Al¬ 
gorithm  18.5  is  used  in  the  SLQP  method.  This  adaptation  is  obtained  simply  by  setting 
^lxCk  —  0  in  the  definition  (18.49)  of  cpL  and  applying  Algorithm  18.5  to  determine  p 
and  to  compute  the  LP  step  pLr.  The  extra  LP  solves  required  by  Algorithm  18.5  in  this 
case  are  typically  inexpensive,  requiring  relatively  few  simplex  iterations,  because  we  can 
use  warm-start  information  from  LPs  solved  earlier,  with  different  values  of  the  penalty 
parameter. 


1 8.6  NONLINEAR  GRADIENT  PROJECTION 


In  Section  16.7,  we  discussed  the  gradient  projection  method  for  bound  constrained 
quadratic  programming.  It  is  not  difficult  to  extend  this  method  to  the  problem 

min  f(x)  subject  to  l  <  x  <  u,  (18.58) 

where  /  is  a  nonlinear  function  and  I  and  u  are  vectors  of  lower  and  upper  bounds, 
respectively. 
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We  begin  by  describing  a  line  search  approach.  At  the  current  iterate  x k,  we  form  the 
quadratic  model 


qk(x)  —  fk  +  V  fl (x  -  xk)  +  ±(x  -  xk)T Bk(x  -  xk),  (18.59) 

where  B k  is  a  positive  definite  approximation  to  V2f(xk).  We  then  use  the  gradient  projec¬ 
tion  method  for  quadratic  programming  (Algorithm  16.5)  to  find  an  approximate  solution 
x  of  the  subproblem 

min  qk(x)  subject  to  l  <  x  <  u.  (18.60) 

The  search  direction  is  defined  as  pk  —  x  —  xk  and  the  new  iterate  is  given  by  xk+\  — 
xk  +  akpk,  where  the  steplength  ak  is  chosen  to  satisfy 

f{xk  +  akpk)  <  f{x, t)  +  rpxkV  fk  pk 

for  some  parameter  q  e  (0,  1). 

To  see  that  the  search  direction  pk  is  indeed  a  descent  direction  for  the  objective 
function,  we  use  the  properties  of  Algorithm  16.5,  as  discussed  in  Section  16.7.  Recall 
that  this  method  searches  along  a  piecewise  linear  path — the  projected  steepest  descent 
path — for  the  Cauchy  point  xc,  which  minimizes  qk  along  this  path.  It  then  identifies  the 
components  of  x  that  are  at  their  bounds  and  holds  these  components  constant  while 
performing  an  unconstrained  minimization  of  qk  over  the  remaining  components  to  obtain 
the  approximate  solution  x  of  the  subproblem  (18.60). 

The  Cauchy  point  xc  satisfies  qk(xc)  <  qk(xk)  if  the  projected  gradient  is  nonzero. 
Since  Algorithm  16.5  produces  a  subproblem  solution  x  with  qk(x)  <  qk(xc),  we  have 


fk  =  qk(x, t)  >  qk(xc )  >qk(x)=  fk  +  V// pk  +  \pTkBkpk. 


This  inequality  implies  that  V  fk  pk  <  0,  since  Bk  is  assumed  to  be  positive  definite. 

We  now  consider  a  trust-region  gradient  projection  method  for  solving  (18.58).  We 
begin  by  forming  the  quadratic  model  (18.59),  but  since  there  is  no  requirement  for  qk  to 
be  convex,  we  can  define  Bk  to  be  the  Hessian  V2/(x/t)  or  a  quasi-Newton  approximation 
obtained  from  the  BFGS  or  SRI  formulas.  The  step  pk  is  obtained  by  solving  the  subproblem 

min  qk(x)  subject  to  {l  <  x  <  u,  \\x  —  x*||oo  <  A/..},  (18.61) 

for  some  Ak  >  0.  This  problem  can  be  posed  as  a  bound-constrained  quadratic  program  as 
follows: 


min^,(.r)  subject  to  max(/,  xk  —  A.ke)  <  x  <  min(i<,  xk  +  Ake ), 
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where  e  —  (1,1,...,  l)r.  Algorithm  16.5  can  be  used  to  solve  this  subproblem.  The  step 
Pk  is  accepted  or  rejected  following  standard  trust-region  strategies,  and  the  radius  A*  is 
updated  according  to  the  agreement  between  the  change  in  /  and  the  change  in  (p  produced 
by  the  step  pp,  see  Chapter  4. 

The  two  gradient  projection  methods  just  outlined  require  solution  of  an  inequality- 
constrained  quadratic  subproblem  at  each  iteration,  and  so  are  formally  IQP  methods.  They 
can,  however,  be  viewed  also  as  EQP  methods  because  of  their  use  of  Algorithm  1 6.5  in  solving 
the  subproblem.  This  algorithm  first  identifies  a  working  set  (by  finding  the  Cauchy  point) 
and  then  solves  an  equality-constrained  subproblem  (by  fixing  the  working-set  constraints 
at  their  bounds).  For  large  problems,  it  is  efficient  to  perform  the  subpace  minimization 
(16.74)  by  using  the  conjugate  gradient  method.  A  preconditioner  is  sometimes  needed  to 
make  this  approach  practical;  the  most  popular  choice  is  the  incomplete  (and  modified) 
Cholesky  factorization  outlined  in  Algorithm  7.3. 

The  gradient  projection  approach  can  be  extended  in  principle  to  more  general  (linear 
or  convex)  constraints.  Practical  implementations  are  however  limited  to  the  bound  con¬ 
strained  problem  (18.58)  because  of  the  high  cost  of  computing  projections  onto  general 
constraint  sets. 


1 8.7  CONVERGENCE  ANALYSIS 


Numerical  experience  has  shown  that  the  SQP  and  SLQP  methods  discussed  in  this  chapter 
often  converge  to  a  solution  from  remote  starting  points.  Hence,  there  has  been  considerable 
interest  in  understanding  what  drives  the  iterates  toward  a  solution  and  what  can  cause  the 
algorithms  to  fail.  These  global  convergence  studies  have  been  valuable  in  improving  the 
design  and  implementation  of  algorithms. 

Some  early  results  make  strong  assumptions,  such  as  boundedness  of  multipliers,  well 
posedness  of  the  subproblem  (18.11),  and  regularity  of  constraint  Jacobians.  More  recent 
studies  relax  many  of  these  assumptions  with  the  goal  of  understanding  both  the  successful 
and  unsuccessful  outcomes  of  the  iteration.  We  now  state  a  classical  global  convergence 
result  that  gives  conditions  under  which  a  standard  SQP  algorithm  always  identifies  a  KKT 
point  of  the  nonlinear  program. 

Consider  an  SQP  method  that  computes  a  search  direction  pk  by  solving  the  quadratic 
program  (18.11).  We  assume  that  the  Hessian  Vxx£k  is  replaced  in  (18.11a)  by  some 
symmetric  and  positive  definite  approximation  Bk .  The  new  iterate  is  defined  as  x-k+ 1  +  otk  Pk » 
where  oik  is  computed  by  a  backtracking  line  search,  starting  from  the  unit  steplength,  and 
terminating  when 

(pi{xk  +  akpk\  p)  <  <t>\{xk\  p)  -  rjak(qti( 0)  -  q^(Pk)), 

where  ?;  e  (0,  1),  with  (p \  defined  as  in  (18.51)  and  qt,  defined  as  in  (18.49).  To  establish 
the  convergence  result,  we  assume  that  each  quadratic  program  (18.11)  is  feasible  and 
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determines  a  bounded  solution  pk-  We  also  assume  that  the  penalty  parameter  p  is  fixed  for 
all  k  and  sufficiently  large. 

Theorem  18.3. 

Suppose  that  the  SQP  algorithm  just  described  is  applied  to  the  nonlinear  program  (18.10). 
Suppose  that  the  sequences  {xu }  and  {xk  +  Pk }  are  contained  in  a  closed,  bounded,  convex  region 
of  R"  in  which  f  and  ci  have  continuous  first  derivatives.  Suppose  that  the  matrices  Bk  and 
multipliers  are  bounded  and  that  p  satisfies  p  >  ||At||oo  +  P  for  all  k,  where  p  is  a  positive 
constant.  Then  all  limit  points  of  the  sequence  [xk]  are  KKT  points  of  the  nonlinear  program 
(18.10). 

The  conclusions  of  the  theorem  are  quite  satisfactory,  but  the  assumptions  are  some¬ 
what  restrictive.  For  example,  the  condition  that  the  sequence  {xk  +  Pk }  stays  within  in  a 
bounded  set  rules  out  the  case  in  which  the  Hessians  Bk  or  constraint  Jacobians  become 
ill  conditioned.  Global  convergence  results  that  are  established  under  more  realistic  condi¬ 
tions  are  surveyed  by  Conn,  Gould,  and  Toint  [74].  An  example  of  a  result  of  this  type  is 
Theorem  19.2.  Although  this  theorem  is  established  for  a  nonlinear  interior-point  method, 
similar  results  can  be  established  for  trust-region  SQP  methods. 


RATE  OF  CONVERGENCE 

We  now  derive  conditions  that  guarantee  the  local  convergence  of  SQP  methods,  as 
well  as  conditions  that  ensure  a  superlinear  rate  of  convergence.  For  simplicity,  we  limit  our 
discussion  to  Algorithm  18.1  for  equality- constrained  optimization,  and  consider  both  exact 
Hessian  and  quasi-Newton  versions.  The  results  presented  here  can  be  applied  to  algorithms 
for  inequality- constrained  problems  once  the  active  set  has  settled  at  its  final  optimal  value 
(see  Theorem  18.1). 

We  begin  by  listing  a  set  of  assumptions  on  the  problem  that  will  be  useful  in  this 
section. 

Assumptions  18.2. 

The  point  x*  is  a  local  solution  of  problem  (18.1)  at  which  the  following  conditions  hold. 

(a)  The  functions  f  and  c  are  twice  differentiable  in  a  neighborhood  of  x *  with  Lipschitz 
continuous  second  derivatives. 

(b)  The  linear  independence  constraint  qualification  (Definition  12.4)  holds  atx* .  This  con¬ 
dition  implies  that  the  KKT  conditions  ( 12.34)  are  satisfied  for  some  vector  of  multipliers 
X*. 

(c)  The  second-order  sufficient  conditions  (Theorem  12.6)  hold  at  (x*,  A*). 

We  consider  first  an  SQP  method  that  uses  exact  second  derivatives. 
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Theorem  18.4. 

Suppose  that  Assumptions  18.2  hold.  Then,  if(x o,  ao)  is  sufficiently  close  to  (x*,  X*),  the 
pairs  (xk,  X k)  generated  by  Algorithm  18.1  converge  quadratically  to  (x* ,  X *). 

The  proof  follows  directly  from  Theorem  11.2,  since  we  know  that  Algorithm  18.1  is 
equivalent  to  Newton’s  method  applied  to  the  nonlinear  system  F(x,  X)  =  0,  where  F  is 
defined  by  (18.3). 

We  turn  now  to  quasi-Newton  variants  of  Algorithm  18.1,  in  which  the  Lagrangian 
Hessian  Vfx£{xk ,  X k )  is  replaced  by  a  quasi-Newton  approximation  B k .  We  discussed  in  Sec¬ 
tion  18.3  algorithms  that  used  approximations  to  the  full  Hessian,  and  also  reduced-Hessian 
methods  that  maintained  approximations  to  the  projected  Hessian  Zf  Vxx£(xk,  Xk)Zk.  As 
in  the  earlier  discussion,  we  take  Zk  to  be  the  n  x  (n  —  m )  matrix  whose  columns  span  the 
null  space  of  Ak,  assuming  in  addition  that  the  columns  of  Zk  are  orthornormal;  see  (15.22). 

If  we  multiply  the  first  block  row  of  the  KKT  system  (18.9)  by  Zk,  we  obtain 


Z'[V2xxCkPk  =  -Z[V/,.  (18.62) 

This  equation,  together  with  the  second  block  row  Akpk  =  —ck  of  (18.9),  is  sufficient  to 
determine  fully  the  value  of  pk  when  xk  and  Xk  are  not  too  far  from  their  optimal  values. 
In  other  words,  only  the  projection  of  the  Hessian  Z[  VxxCk  is  significant;  the  remainder  of 
VxxCk  (its  projection  onto  the  range  space  of  Af )  does  not  play  a  role  in  determinining  pk. 

By  multiplying  (18.62)  by  Zk,  and  defining  the  following  matrix  Pk,  which  projects 
onto  the  null  space  of  Ak : 


Pk=I~  ATk  [A,A[]  1  Ak  =  ZkZ[, 
we  can  rewrite  (18.62)  equivalently  as  follows: 


PkV2xxCkPk  =  —PkVfk. 


The  discussion  above,  together  with  Theorem  18.4,  suggests  that  a  quasi-Newton  method 
will  be  locally  convergent  if  the  quasi-Newton  matrix  Bk  is  chosen  so  that  Pk  Bk  is  a  reasonable 
approximation  of  Pk  Vxx Ck ,  and  that  it  will  be  superlinearly  convergent  if  Pk  Bk  approximates 
PkVxx£k  well.  To  make  the  second  statement  more  precise,  we  present  a  result  that  can  be 
viewed  as  an  extension  of  characterization  of  superlinear  convergence  (Theorem  3.6)  to  the 
equality-constrained  case.  In  the  following  discussion,  Vjv£*  denotes  Vjt£(x*,  1*). 

Theorem  18.5. 

Suppose  that  Assumptions  18.2  hold  and  that  the  iterates  xk  generated  by  Algorithm  18.1 
with  quasi-Newton  approximate  Hessians  Bk  converge  fox*.  Thenxk  converges  superlinearly 
if  and  only  if  the  Hessian  approximation  Bk  satisfies 

Pk{Bk  -  Vxx£*){xk+i  -  xk)\\ 


lim 

k—>o o 


W*k+\  -  xk 


(18.63) 


18.7.  Convergence  Analysis  559 


We  can  apply  this  result  to  the  quasi-Newton  updating  schemes  discussed  earlier  in 
this  chapter,  beginning  with  the  full  BFGS  approximation  based  on  (18.13).  To  guarantee 
that  the  BFGS  approximation  is  always  well  defined,  we  make  the  (strong)  assumption  that 
the  Hessian  of  the  Lagrangian  is  positive  definite  at  the  solution. 

Theorem  18.6. 

Suppose  that  Assumptions  1 8.2  hold.  Assume  also  thatVxxC^  anl 1  Bo  are  symmetric  and 
positive  definite.  If\\xo  —  x*  ||  and  ||  B0  —  V^v£*  ||  are  sufficiently  small,  the  iterates  xk  generated 
by  Algorithm  18.1  with  BFGS  Hessian  approximations  Bk  defined  by  (18.13)  and  (18.16)  (with 
rk  —  Sk)  satisfy  the  limit  (18.63).  Therefore,  the  iterates xk  converge  superlinearly  to  x* . 

For  the  damped  BFGS  updating  strategy  given  in  Procedure  18.2,  we  can  show  that  the 
rate  of  convergence  is  R-superlinear  (not  the  usual  Q-superlinear  rate;  see  the  Appendix). 

We  now  consider  reduced-Hessian  SQP  methods  that  update  an  approximation  Mk 
to  Zf  VxxCkZk.  From  the  definition  of  Pk,  we  see  that  ZkMkZf  can  be  considered  as  an 
approximation  to  the  two-sided  projection  Pk37xxCkPk.  Since  reduced-Hessian  methods  do 
not  approximate  the  one-sided  projection  PkVxxCk,  we  cannot  expect  (18.63)  to  hold.  For 
these  methods,  we  can  state  a  condition  for  superlinear  convergence  by  writing  (18.63)  as 


lim 

k—>oo 


Pk(Bk  -  V2xxC*)Pk(xk+l  -  xk) 

\\xk+i  -  xk  || 

|  Pk(Bk  -  -  Pk)(xk+ 1  -  xk) 

ll**+i -x*|| 


=  0, 


(18.64) 


and  defining  Bk  —  ZkMkZf .  The  following  result  shows  that  it  is  necessary  only  for  the  first 
term  in  (18.64)  to  go  to  zero  to  obtain  a  weaker  form  of  superlinear  convergence,  namely, 
two-step  superlinear  convergence. 

Theorem  18.7. 

Suppose  that  Assumption  18.2(a)  holds  and  that  the  matrices  Bk  are  bounded.  Assume 
also  that  the  iterates  xk  generated  by  Algorithm  18.1  with  approximate  Hessians  Bk  converge  to 
x*,  and  that 


lim 

k—>o o 


Pk(Bk  -  S7*x£*)Pk(xk+i  -  xk)\\ 
\\xk+i  -  At  II 


=  0. 


Then  the  sequence  {xr-}  converges  to  x*  two-step  superlinearly,  that  is, 


lim 

k—>o o 


\\xk+2  -  X* 

II**  -  **ll 


=  0. 


(18.65) 


In  a  reduced-Hessian  method  that  uses  BFGS  updating,  the  iteration  is  xk+\  =  xk  + 
Ykpx  +  Zk pz,  where  pY  and  pz  are  given  by  ( 18.19a),  (18.23)  (with  (Zf  VxxCkZk)  replaced 
by  Mk).  The  reduced-Hessian  approximation  Mk  is  updated  by  the  BFGS  formula  using 
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the  correction  vectors  (18.26),  and  the  initial  approximation  M0  is  symmetric  and  positive 
definite.  If  we  make  the  assumption  that  the  null  space  bases  Z<,  used  to  define  the  correction 
vectors  ( 18.26)  vary  smoothly,  then  we  can  apply  Theorem  18.7  to  show  that  Xk  converges 
two-step  superlinearly. 


1 8.8  PERSPECTIVES  AND  SOFTWARE 


SQP  methods  are  most  efficient  if  the  number  of  active  constraints  is  nearly  as  large  as  the 
number  of  variables,  that  is,  if  the  number  of  free  variables  is  relatively  small.  They  require 
few  evaluations  of  the  functions,  in  comparison  with  augmented  Lagrangian  methods,  and 
can  be  more  robust  on  badly  scaled  problems  than  the  nonlinear  interior-point  methods 
described  in  the  next  chapter.  It  is  not  known  at  present  whether  the  IQP  or  EQP  approach 
will  prove  to  be  more  effective  for  large  problems.  Current  reasearch  focuses  on  widening 
the  class  of  problems  that  can  be  solved  with  SQP  and  SLQP  approaches. 

Two  established  SQP  software  packages  are  snopt  [128]  and  filtersqp  [105].  The 
former  code  follows  a  line  search  approach,  while  the  latter  implements  a  trust-region 
strategy  using  a  filter  for  step  acceptance.  The  SLQP  approach  of  Section  18.5  is  implemented 
in  knitro/ active  [49] .  All  three  packages  include  mechanisms  to  ensure  that  the  subproblems 
are  always  feasible  and  to  guard  against  rank-deficient  constraint  Jacobians.  snopt  uses  the 
penalty  (or  elastic)  mode  (18.12),  which  is  invoked  if  the  SQP  subproblem  is  infeasible  or  if 
the  Lagrange  multiplier  estimates  become  very  large  in  norm,  filtersqp  includes  a  feasibility 
restoration  phase  that,  in  addition  to  promoting  convergence,  provides  rapid  identification 
of  convergence  to  infeasible  points,  knitro/active  implements  a  penalty  method  using  the 
update  strategy  of  Algorithm  18.5. 

There  is  no  established  implementation  of  the  SQQP  approach,  but  prototype  imple¬ 
mentations  have  shown  promise.  The  conopt  [9]  package  implements  a  generalized  reduced 
gradient  method  as  well  as  an  SQP  method. 

Quasi-Newton  approximations  to  the  Hessian  of  the  Lagrangian  are  often 

used  in  practice.  BFGS  updating  is  generally  less  effective  for  constrained  problems  than 
in  the  unconstrained  case  because  of  the  requirement  of  maintaining  a  positive  definite 
approximation  to  an  underlying  matrix  that  often  does  not  have  this  property.  Nevertheless, 
the  BFGS  and  limited-memory  BFGS  approximations  implemented  in  snopt  and  knitro 
perform  adequately  in  practice,  knitro  also  offers  an  SRI  option  that  may  be  more  effective 
than  the  BFGS  option,  but  the  question  of  how  best  to  implement  full  quasi-Newton  approx¬ 
imations  for  constrained  optimization  requires  further  investigation.  The  RSQP  package  [13] 
implements  an  SQP  method  that  maintains  a  quasi-Newton  approximation  to  the  reduced 
Hessian. 

The  Maratos  effect,  if  left  unattended,  can  significantly  slow  optimization  algorithms 
that  use  nonsmooth  merit  functions  or  filters.  However,  selective  application  of  second-order 
correction  steps  adequately  resolves  the  difficulties  in  practice. 
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Trust-region  implementations  of  the  gradient  projection  method  include  tron  [192] 
and  Lancelot  [72].  Both  codes  use  a  conjugate  gradient  iteration  to  perform  the  subspace 
minimization  and  apply  an  incomplete  Cholesky  preconditioner.  Gradient  projection  meth¬ 
ods  in  which  the  Hessian  approximation  is  defined  by  limited-memory  BFGS  updating  are 
implemented  in  lbfgs-b  [322]  and  blmvm  [17].  The  properties  of  limited-memory  BFGS 
matrices  can  be  exploited  to  perform  the  projected  gradient  search  and  subpace  minimiza¬ 
tion  efficiently,  spg  [23]  implements  the  gradient  projection  method  using  a  nonmonotone 
line  search. 


NOTES  AND  REFERENCES 

SQP  methods  were  first  proposed  in  1963  by  Wilson  [306]  and  were  developed  in 
the  1970s  by  Garcia-Palomares  and  Mangasarian  [117],  Han  [163,  164],  and  Powell  [247, 
250,  249],  among  others.  Trust-region  variants  are  studied  by  Vardi  [295],  Celis,  Dennis, 
and  Tapia  [56],  and  Byrd,  Schnabel,  and  Shultz  [55].  See  Boggs  and  Tolle  [33]  and  Gould, 
Orban,  and  Toint  [147]  for  literature  surveys. 

The  SLQP  approach  was  proposed  by  Fletcher  and  Sainz  de  la  Maza  [108]  and  was 
further  developed  by  Chin  and  Fletcher  [59]  and  Byrd  et  al.  [49].  The  latter  paper  discusses 
how  to  update  the  LP  trust  region  and  many  other  details  of  implementation.  The  technique 
for  updating  the  penalty  parameter  implemented  in  Algorithm  18.5  is  discussed  in  [49,  47]. 
The  SfiQP  method  was  proposed  by  Fletcher;  see  [101]  for  a  complete  discussion  of  this 
method. 

Some  analysis  shows  that  several — but  not  all — of  the  good  properties  of  BFGS  updat¬ 
ing  are  preserved  by  damped  BFGS  updating.  Numerical  experiments  exposing  the  weakness 
of  the  approach  are  reported  by  Powell  [254].  Second-order  correction  strategies  were  pro¬ 
posed  by  Coleman  and  Conn  [65],  Fletcher  [100],  Gabay  [116],  andMayne  and  Polak  [204]. 
The  watchdog  technique  was  proposed  by  Chamberlain  et  al.  [57]  and  other  nonmonotone 
strategies  are  described  by  Bonnans  et  al.  [36].  For  a  comprehensive  discussion  of  second- 
order  correction  and  nonmonotone  techniques,  see  the  book  by  Conn,  Gould,  and  Toint 
[74], 

Two  filter  SQP  algorithms  are  described  by  Fletcher  and  Leyffer  [105]  and  Fletcher, 
Leyffer,  and  Toint  [106].  It  is  not  yet  known  whether  the  filter  strategy  has  advantages 
over  merit  functions.  Both  approaches  are  undergoing  development  and  improved  imple¬ 
mentations  can  be  expected  in  the  future.  Theorem  18.3  is  proved  by  Powell  [252]  and 
Theorem  18.5  by  Boggs,  Tolle,  and  Wang  [34]. 


^  Exercises 


i#7  18.1  Show  that  in  the  quadratic  program  (18.7)  we  can  replace  the  linear  term  V  /Ar  p 

by  Vx£(.T£,  XkY P  without  changing  the  solution. 
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&  18.2  Prove  Theorem  18.4. 

&  18.3  Write  a  program  that  implements  Algorithm  18.1.  Use  it  to  solve  the  problem 


min eXlX2*3XiXs  -\(x\  +  x\  +  l)2 

(18.66) 

subject  to  x\  +  x\  +  x\  +  x\  +  x\  —  10  =  0, 

(18.67) 

TC2X3  —  5x4X5  =  0, 

(18.68) 

Xj  -f-  x^  -f-  1  —  0. 

(18.69) 

Use  the  starting  point  x0  =  (—1.71,  1.59,  1.82,  —0.763,  —  0.763)r.  The  solution  is  x*  = 
(-1.8,  1.7,  1.9,  -0.8,  — 0.8)r. 

&  18.4  Show  that  the  damped  BFGS  updating  satisfies  (18.17). 

&  18.5  Consider  the  constraint. x\  +  x\  =  1.  Write  the  linearized  constraints  (18.7b)  at 

the  following  points:  (0,  0)r,  (0,  l)r,  (0.1,  0.02)r,  —(0.1,  0.02)r. 

&  18.6  Prove  Theorem  18.2  for  the  case  in  which  the  merit  function  is  given  by 

</>(x;  ji)  —  /(x)  +  n\\c(x)\\q,  where  q  >  0.  Use  this  lemma  to  show  that  the  condition  that 
ensures  descent  is  given  by  fx  >  ||  A,*+i  ||r,  where  r  >  0  satisfies  r_1  +  q~1  —  1. 

18.7  Write  a  program  that  implements  the  reduced-Hessian  method  given  by  (18.18), 
( 18.19a),  ( 18.21),  (18.23).  Use  your  program  to  solve  the  problem  given  in  Exercise  18.3. 

18.8  Show  that  the  constraints  (18.50b)-(18.50e)  are  always  consistent. 

18.9  Show  that  the  feasibility  problem  (18.45a) — ( 18.45b)  always  has  a  solution  14- 
lying  in  the  range  space  of  A[ .  Hint:  First  show  that  if  the  trust-region  constraint  (18.45b) 
is  active,  14  lies  in  the  range  space  of  Aj.  Next,  show  that  if  the  trust  region  is  inactive,  the 
minimum-norm  solution  of  (18.45a)  lies  in  the  range  space  of  Aj . 


Chapter 


Interior-Point 
Methods  for 
Nonlinear 
Prosramming 


Interior-point  (or  barrier)  methods  have  proved  to  be  as  successful  for  nonlinear  optimiza¬ 
tion  as  for  linear  programming,  and  together  with  active-set  SQP  methods,  they  are  currently 
considered  the  most  powerful  algorithms  for  large-scale  nonlinear  programming.  Some  of 
the  key  ideas,  such  as  primal-dual  steps,  carry  over  directly  from  the  linear  programming 
case,  but  several  important  new  challenges  arise.  These  include  the  treatment  of  noncon¬ 
vexity,  the  strategy  for  updating  the  barrier  parameter  in  the  presence  of  nonlinearities,  and 
the  need  to  ensure  progress  toward  the  solution.  In  this  chapter  we  describe  two  classes  of 
interior-point  methods  that  have  proved  effective  in  practice. 
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The  methods  in  the  first  class  can  be  viewed  as  direct  extensions  of  interior-point 
methods  for  linear  and  quadratic  programming.  They  use  line  searches  to  enforce  conver¬ 
gence  and  employ  direct  linear  algebra  (that  is,  matrix  factorizations)  to  compute  steps.  The 
methods  in  the  second  class  use  a  quadratic  model  to  define  the  step  and  incorporate  a  trust- 
region  constraint  to  provide  stability.  These  two  approaches,  which  coincide  asymptotically, 
have  similarities  with  line  search  and  trust-region  SQP  methods. 

Barrier  methods  for  nonlinear  optimization  were  developed  in  the  1960s  but  fell  out  of 
favor  for  almost  two  decades.  The  success  of  interior-point  methods  for  linear  programming 
stimulated  renewed  interest  in  them  for  the  nonlinear  case.  By  the  late  1990s,  a  new  genera¬ 
tion  of  methods  and  software  for  nonlinear  programming  had  emerged.  Numerical  experi¬ 
ence  indicates  that  interior-point  methods  are  often  faster  than  active-set  SQP  methods  on 
large  problems,  particularly  when  the  number  of  free  variables  is  large.  They  may  not  yet  be 
as  robust,  but  significant  advances  are  still  being  made  in  their  design  and  implementation. 
The  terms  “interior-point  methods”  and  “barrier  methods”  are  now  used  interchangeably. 

In  Chapters  14  and  16  we  discussed  interior-point  methods  for  linear  and  quadratic 
programming.  It  is  not  essential  that  the  reader  study  those  chapters  before  reading  this  one, 
although  doing  so  will  give  a  better  perspective.  The  first  part  of  this  chapter  assumes  famil¬ 
iarity  primarily  with  the  KKT  conditions  and  Newton’s  method,  and  the  second  part  of  the 
chapter  relies  on  concepts  from  sequential  quadratic  programming  presented  in  Chapter  18. 

The  problem  under  consideration  in  this  chapter  is  written  as  follows: 


subject  to 


min  f(x) 

(19.1a) 

X,S 

cE(x)  =  0, 

(19.1b) 

c,(x)  —  s  —  0, 

(19.1c) 

s  >  0. 

(19. Id) 

The  vector  cfx)  is  formed  from  the  scalar  functions  c,(x),  i  e  X,  and  similarly  for  cE(x). 
Note  that  we  have  transformed  the  inequalities  Ci  (x )  >  0  into  equalities  by  the  introduction 
of  a  vector  s  of  slack  variables.  We  use  l  to  denote  the  number  of  equality  constraints  (that 
is,  the  dimension  of  the  vector  cE)  and  m  to  denote  the  number  of  inequality  constraints 
(the  dimension  of  cj. 


19.1  TWO  INTERPRETATIONS 


Interior-point  methods  can  be  seen  as  continuation  methods  or  as  barrier  methods.  We 
discuss  both  derivations,  starting  with  the  continuation  approach. 

The  KKT  conditions  (12.1)  for  the  nonlinear  program  (19.1)  can  be  written  as 


V/(x)  -  A/(x)y  -  AlT{x)z  =  0, 
Sz  —  lie  =  0, 


(19.2a) 

(19.2b) 


19.1.  Two  Interpretations  565 


cE  (x)  =  0,  (19.2c) 

Cj(x)  —  s  —  0,  (19. 2d) 

with  /x  —  0,  together  with 

s>  0,  z  >  0.  (19.3) 

Here  Af(x)  and  A:(x)  are  the  Jacobian  matrices  of  the  functions  cE  and  c„  respectively,  and 
y  and  z  are  their  Lagrange  multipliers.  We  define  S  and  Z  to  be  the  diagonal  matrices  whose 
diagonal  entries  are  given  by  the  vectors  s  and  z,  respectively,  and  let  e  —  (1,  1,  . . . ,  l)r. 

Equation  (19.2b),  with  \x  —  0,  and  the  bounds  (19.3)  introduce  into  the  problem  the 
combinatorial  aspect  of  determining  the  optimal  active  set,  illustrated  in  Example  15.1.  We 
circumvent  this  difficulty  by  letting  jx  be  strictly  positive,  thus  forcing  the  variables  s  and  z  to 
take  positive  values.  The  homotopy  (or  continuation)  approach  consists  of  (approximately) 
solving  the  perturbed  KKT  conditions  (19.2)  for  a  sequence  of  positive  parameters  {pk}  that 
converges  to  zero,  while  maintaining  s,z  >  0.  The  hope  is  that,  in  the  limit,  we  will  obtain 
a  point  that  satisfies  the  KKT  conditions  for  the  nonlinear  program  (19.1).  Furthermore, 
by  requiring  the  iterates  to  decrease  a  merit  function  (or  to  be  acceptable  to  a  filter),  the 
iteration  is  likely  to  converge  to  a  minimizer,  not  simply  a  KKT  point. 

The  homotopy  approach  is  justified  locally.  In  a  neighborhood  of  a  solution 
(x* ,  s*,  y*,  z*)  that  satisfies  the  linear  independence  constraint  qualification  (LICQ)  (Defi¬ 
nition  12.4),  the  strict  complementarity  condition  (Definition  12.5),  and  the  second-order 
sufficient  conditions  (Theorem  12.6),  we  have  that  for  all  sufficiently  small  positive 
values  of  fx,  the  system  (19.2)  has  a  locally  unique  solution,  which  we  denote  by 
(x(/x),  s{/x),  y(/x),  z(/u )).  The  trajectory  described  by  these  points  is  called  the  primal-dual 
central  path,  and  it  converges  to  (. x *,  s*,  y*,  z*)  as  \x  ->  0. 

The  second  derivation  of  interior-point  methods  associates  with  (19.1)  the  barrier 
problem 


min  f(x)  —  [X  }  logs,  (19.4a) 

X,  S  ' 

i= 1 

subject  to  cE(x)  =  0,  (19.4b) 

c^x)  —  s  —  0,  (19.4c) 

where  /x  is  a  positive  parameter  and  log(-)  denotes  the  natural  logarithm  function.  One 
need  not  include  the  inequality  s  >  0  in  (19.4)  because  minimization  of  the  barrier  term 
— ix  in  ( 19.4a)  prevents  the  components  of  s  from  becoming  too  close  to  zero. 

(Recall  that  (—  logf)  -»  oo  as  t  4-  0.)  Problem  (19.4)  also  avoids  the  combinatorial  aspect 
of  nonlinear  programs,  but  its  solution  does  not  coincide  with  that  of  ( 1 9. 1 )  for  /x  >  0.  The 
barrier  approach  consists  of  finding  (approximate)  solutions  of  the  barrier  problem  ( 19.4) 
for  a  sequence  of  positive  barrier  parameters  {/zj.}  that  converges  to  zero. 
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To  compare  the  homotopy  and  barrier  approaches,  we  write  the  KKT  conditions  for 
( 19.4)  as  follows: 


V/(x)  -  A/(x)y  -  AlT{x)z  —  0, 

(19.5a) 

+  z  —  0, 

(19.5b) 

cE{x)  —  0, 

(19.5c) 

cfx)  —  s  =  0. 

(19. 5d) 

Note  that  they  differ  from  (19.2)  only  in  the  second  equation,  which  becomes  quite  nonlinear 
near  the  solution  as  s’  ->  0.  It  is  advantageous  for  Newton’s  method  to  transform  the  rational 
equation  (19.5b)  into  a  quadratic  equation.  We  do  so  by  multiplying  this  equation  by  S,  a 
procedure  that  does  not  change  the  solution  of  (19.5)  because  the  diagonal  elements  of  S 
are  positive.  After  this  transformation,  the  KKT  conditions  for  the  barrier  problem  coincide 
with  the  perturbed  KKT  system  ( 19.2). 

The  term  “interior  point”  derives  from  the  fact  that  early  barrier  methods  [98]  did 
not  use  slacks  and  assumed  that  the  initial  point  Xq  is  feasible  with  respect  to  the  inequality 
constraints  c;  (x)  >0,  i  £  X.  These  methods  used  the  barrier  function 

fix)  -  M£logc,.(*) 

iel 

to  prevent  the  iterates  from  leaving  the  feasible  region  defined  by  the  inequalities.  (We 
discuss  this  barrier  function  further  in  Section  19.6.)  Most  modern  interior-point  methods 
are  infeasible  (they  can  start  from  any  initial  point  Xq)  and  remain  interior  only  with  respect 
to  the  constraints  s  >  0,  z  >  0.  However,  they  can  be  designed  so  that  once  they  generate  a 
feasible  iterate,  all  subsequent  iterates  remain  feasible  with  respect  to  the  inequalities. 

In  the  next  sections  we  will  see  that  the  homotopy  and  barrier  interpretations  are  both 
useful.  The  homotopy  view  gives  rise  to  the  definition  of  the  primal-dual  direction,  whereas 
the  barrier  view  is  vital  in  the  design  of  globally  convergent  iterations. 


1 9-2  A  BASIC  INTERIOR-POINT  ALGORITHM 

Applying  Newton’s  method  to  the  nonlinear  system  (19.2),  in  the  variables  x,  s,  y,  z,  we 
obtain 


~v?,x 

0 

-AJix) 

-Af[x) 

Px 

V/(x)  -  AET (x)y  -  A1T(x)z 

0 

z 

0 

S 

Ps 

Sz  —  /xe 

AE(x) 

0 

0 

0 

Py 

cE(x) 

_A,(x) 

-/ 

0 

0 

_Pz_ 

Cj(x)  —  s 

(19.6) 
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where  C  denotes  the  Lagrangian  for  (19.1a)— ( 19.1c): 


£(x,  S,  y,  z)  —  f(x)  -  yrcE(x)  -  zr(c,(x)  -  s). 

(19.7) 

The  system  (19.6)  is  called  the  primal-dual  system  (in  contrast  with  the  primal  system 
discussed  in  Section  19.3).  After  the  step  p  =  ( px ,  ps.  py ,  pz)  has  been  determined,  we 

compute  the  new  iterate  (x+,  ^+,  y+ ,  z+)  as 

x+  =  x  +  a™px ,  s+  =  s  +  , 

(19.8a) 

y+  =  y  +  a^py,  z+  =  z  +  a^pz. 

(19.8b) 

where 

a”“  =  max{a  e  (0,  1]  :  s  +  aps  >  (1  —  r)^}, 

(19.9a) 

a”“  =  max{a  e  (0,  1]  :  z  +  upz  >  (1  —  r)z}, 

(19.9b) 

with  r  e  (0,  1).  (A  typical  value  of  r  is  0.995.)  The  condition  (19.9),  called  the  fraction  to 
the  boundary  rule,  prevents  the  variables  s  and  z  from  approaching  their  lower  bounds  of  0 
too  quickly. 

This  simple  iteration  provides  the  basis  of  modern  interior-point  methods,  though 
various  modifications  are  needed  to  cope  with  nonconvexities  and  nonlinearities.  The  other 
major  ingredient  is  the  procedure  for  choosing  the  sequence  of  parameters  {pk},  which 
from  now  on  we  will  call  the  barrier  parameters.  In  the  approach  studied  by  Fiacco  and 
McCormick  [98],  the  barrier  parameter  p  is  held  fixed  for  a  series  of  iterations  until  the 
KKT  conditions  (19.2)  are  satisfied  to  some  accuracy.  An  alternative  approach  is  to  update 
the  barrier  parameter  at  each  iteration.  Both  approaches  have  their  merits  and  are  discussed 
in  Section  19.3. 

The  primal-dual  matrix  in  (19.6)  remains  nonsingular  as  the  iteration  converges  to  a 
solution  that  satisfies  the  second-order  sufficiency  conditions  and  strict  complementarity. 
More  specifically,  if  x*  is  a  solution  point  for  which  strict  complementarity  holds,  then 
for  every  index  i  either  s,  or  Zi  remains  bounded  away  from  zero  as  the  iterates  approach 
x*,  ensuring  that  the  second  block  row  of  the  primal-dual  matrix  (19.6)  has  full  row  rank. 
Therefore,  the  interior-point  approach  does  not,  in  itself,  give  rise  to  ill  conditioning  or 
singularity.  This  fact  allows  us  to  establish  a  fast  (superlinear)  rate  of  convergence;  see 
Section  19.8. 

We  summarize  the  discussion  by  describing  a  concrete  implementation  of  this  basic 
interior-point  method.  We  use  the  following  error  function,  which  is  based  on  the  perturbed 
KKT  system  (19.2); 


E(x,s,  y,z;  p)  —  max  {||  V/(x)  -  AE(x)Ty  -  AI(x)rz||,  ||Sz  -  pe ||, 

l|cE(x)||,  ||c,(x)  -ill},  (19.10) 


for  some  vector  norm 
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Algorithm  19.1  (Basic  Interior-Point  Algorithm). 

Choose  xq  and  .?o  >  0,  and  compute  initial  values  for  the  multipliers  vo  and  zo  >  0. 
Select  an  initial  barrier  parameter  po  >  0  and  parameters  er,  r  e  (0,  1).  Set  k  a-  0. 


repeat  until  a  stopping  test  for  the  nonlinear  program  (19.1)  is  satisfied 
repeat  until  E(xk,  sk,  yk,  Zk\  Abt)  <  Ma 

Solve  (19.6)  to  obtain  the  search  direction  p  —  ( px ,  ps,  py,  pz); 
Compute  a““,  a““  using  (19.9); 

Compute  (**+i,  sk+i,  yk+\,  Zk+i)  using  (19.8); 

Set  pk+\  <-  pk  and  k  <—  k  +  1; 

end 

Choose  pk  e  (0,  apk)\ 


An  algorithm  that  updates  the  barrier  parameter  pk  at  every  iteration  is  easily  obtained 
from  Algorithm  19. 1  by  removing  the  requirement  that  the  KKT  conditions  be  satisfied  for 
each  pk  (the  inner  “repeat”  loop)  and  by  using  a  dynamic  rule  for  updating  pk  in  the 
penultimate  line. 

The  following  theorem  provides  a  theoretical  foundation  for  interior-point  methods 
that  compute  only  approximate  solutions  of  the  barrier  problem. 

Theorem  19.1. 

Suppose  that  Algorithm  19.1  generates  an  infinite  sequence  of  iterates  {xa}  and  that 
{pk}  — >  0  (that  is,  that  the  algorithm  does  not  loop  infinitely  in  the  inner  “repeat”  statement). 
Suppose  that  f  and  c  are  continuously  differentiable  functions.  Then  all  limit  points  x  of{xk } 
are  feasible.  Furthermore,  if  any  limit  point  x  of{xk]  satisfies  the  linear  independence  constraint 
qualification  (LICQ),  then  the  first-order  optimality  conditions  of  the  problem  (19.1)  hold  at. x. 


PROOF.  For  simplicity,  we  prove  the  result  for  the  case  in  which  the  nonlinear  program 
(19.1)  contains  only  inequality  constraints,  leaving  the  extension  of  the  result  as  an  exercise. 
For  ease  of  notation,  we  denote  the  inequality  constraints  cl  by  c.  Let  x  be  a  limit  point  of  the 
sequence  {xa-},  and  let  \xkl }  be  a  convergent  subsequence,  namely,  [xkl }  ->  x.  Since  jik  ->  0, 
the  error  E  given  by  ( 19. 10)  converges  to  zero,  so  we  have  (ckl  —  skl )  -»  0.  By  continuity  of 
c,  this  fact  implies  that  c  =  c{x)  >  0  (that  is,  x  is  feasible)  and  skl  — >•  s  —  c. 

Now  suppose  that  the  linear  independence  constraint  qualification  holds  at  x,  and 
consider  the  set  of  active  indices 


A  —  {/  :  Cj  —  0}. 
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For  i  A,  we  have  c,-  >  0  and  .?,■  >  0,  and  Lhus  by  the  complementarity  condition  ( 19.2b), 
we  have  that  [zk,]i  —>  0.  From  this  fact  and  V /*,  —  ATzk,  0,  we  deduce  that 


V  ~  X! t Zk‘  1 «' V c<  ( At, )  ->•  0-  (19.11) 

ieA 


By  the  constraint  qualification  hypothesis,  the  vectors  { Vc,-  :  i  e  A}  are  linearly  indepen¬ 
dent.  Hence,  by  (19.11)  and  continuity  of  V/(-)  and  Vc(,-)(-)>  i  e  Al,  the  positive  sequence 
{Zki)  converges  to  some  value  z  >  0.  Taking  the  limit  in  (19.11),  we  have  that 

v/(-t)  =  E2,vci(JE). 

ie^l 


We  also  have  that  cr£  =  0,  completing  the  proof.  □ 

Practical  interior-point  algorithms  fall  into  two  categories.  The  first  builds  on  Algo¬ 
rithm  19. 1,  adding  a  line  search  and  features  to  control  the  rate  of  decrease  in  the  slacks  s  and 
multipliers  z,  and  introducing  modifications  in  the  primal-dual  sytem  when  negative  curva¬ 
ture  is  encountered.  The  second  category  of  algorithms,  presented  in  Section  19.5,  computes 
steps  by  minimizing  a  quadratic  model  of  (19.4),  subject  to  a  trust-region  constraint.  The 
two  approaches  share  many  features  described  in  the  next  section. 


19.3  ALGORITHMIC  DEVELOPMENT 


We  now  discuss  a  series  of  modifications  and  extensions  of  Algorithm  19.1  that  enable  it  to 
solve  nonconvex  nonlinear  problems,  starting  from  any  initial  estimate. 

Often,  the  primal-dual  system  (19.6)  is  rewritten  in  the  symmetric  form 


'Vx2r£  0  A/(x)  A,r(x)' 

Px 

V / (x )  -  A/{x)y  -  AlT{x)z 

O 

M 

o 

1 

Ps 

z  —  /xS~1e 

Ae(x)  0  0  0 

-Py 

Cr(x) 

A,(jt)  -7  0  0 

_-Pz_ 

cd.r)  —  ^ 

(19.12) 

where 

£  =  S~XZ.  (19.13) 

This  formulation  permits  the  use  of  a  symmetric  linear  equations  solver,  which  reduces  the 
computational  work  of  each  iteration. 
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PRIMAL  VS.  PRIMAL-DUAL  SYSTEM 

If  we  apply  Newton’s  method  directly  to  the  optimality  conditions  ( 19.5)  of  the  barrier 
problem  (instead  of  transforming  to  ( 19.5b)  first)  and  then  symmetrize  the  iteration  matrix, 
we  obtain  the  system  (19.12)  but  with  E  given  by 

E  =  pS~2.  (19.14) 

This  is  often  called  the  primal  system,  in  contrast  with  the  primal-dual  system  arising  from 

(19.13) .  (This  nomenclature  owes  more  to  the  historical  development  of  interior-point 
methods  than  to  the  concept  of  primal-dual  iterations.)  Whereas  in  the  primal-dual  choice 

( 19.13)  the  vector  z  can  be  seen  as  a  general  multiplier  estimate,  the  primal  term  ( 19.14)  is 
obtained  by  making  the  specific  selection  Z  =  pS~1;we  return  to  this  choice  of  multipliers 
in  Section  19.6. 

Even  though  the  systems  (19.2)  and  (19.5)  are  equivalent,  Newton’s  method  applied 
to  them  will  generally  produce  different  iterates,  and  there  are  reasons  for  preferring  the 
primal-dual  system.  Note  that  (19.2b)  has  the  advantage  that  its  derivatives  are  bounded  as 
any  slack  variables  approach  zero;  such  is  not  the  case  with  (19.5b).  Moreover,  analysis  of  the 
primal  step  as  well  as  computational  experience  has  shown  that,  under  some  circumstances, 
the  primal  step  (19.12),  ( 19.14)  tends  to  produce  poor  steps  that  violate  the  bounds  s  >  0 
and  z  >  0  significantly,  resulting  in  slow  progress;  see  Section  19.6. 

SOLVING  THE  PRIMAL-DUAL  SYSTEM 

Apart  from  the  cost  of  evaluating  the  problem  functions  and  their  derivatives,  the 
work  of  the  interior-point  iteration  is  dominated  by  the  solution  of  the  primal-dual  system 
(19.12),  (19.13).  An  efficient  linear  solver,  using  either  sparse  factorization  or  iterative 
techniques,  is  therefore  essential  for  fast  solution  of  large  problems. 

The  symmetric  matrix  in  (19.12)  has  the  familiar  form  of  a  KKT  matrix  (cf.  (16.7), 
(18.6)),  and  the  linear  system  can  be  solved  by  the  approaches  described  in  Chapter  16.  We 
can  first  reduce  the  system  by  eliminating  ps  using  the  second  equation  in  (19.6),  giving 


■  V2XXC  A/(x)  A/(x)  - 

Px 

V / (x )  -  A/{x)y  -  AlT(x)z 

Ae(x)  0  0 

-Py 

=  - 

CeW 

Aj(.r)  0  -E'1 

_  ~Pz  _ 

c^x)  —  pZ~1e 

(19.15) 


This  system  can  be  factored  by  using  a  symmetric  indefinite  factorization;  see  (16.12).  If  we 
denote  the  coefficient  matrix  in  (19.15)  by  K,  this  factorization  computes  PT  K  P  =  LBLT , 
where  L  is  lower  triangular  and  B  is  block  diagonal,  with  blocks  of  size  1  x  1  or  2  x  2.  F 
is  a  matrix  of  row  and  column  permutations  that  seeks  a  compromise  between  the  goals 
of  preserving  sparsity  and  ensuring  numerical  stability;  see  (3.51)  and  the  discussion  that 
follows. 


19.3.  Algorithmic  Development  571 


The  system  ( 19. 15)  can  be  reduced  further  by  eliminating  pz  using  the  last  equation, 
to  obtain  the  condensed  coefficient  matrix 


V2xx£  +  AitTAi  A/(x) 
Ae(x)  0 


(19.16) 


which  is  much  smaller  than  (19.12)  when  the  number  of  inequality  constraints  is  large. 
Although  significant  fill-in  can  arise  from  the  term  A/  T  AI(  it  is  tolerable  in  many  applica¬ 
tions.  A  particularly  favorable  case,  in  which  A/  T  A:  is  diagonal,  arises  when  the  inequality 
constraints  are  simple  bounds. 

The  primal-dual  system  in  any  of  the  symmetric  forms  (19.12),  (19.15),  (19.16) 
is  ill  conditioned  because,  by  (19.13),  some  of  the  elements  of  E  diverge  to  oo,  while 
others  converge  to  zero  as  p  ->  0.  Nevertheless,  because  of  the  special  form  in  which  this 
ill  conditioning  arises,  the  direction  computed  by  a  stable  direct  factorization  method  is 
usually  accurate.  Damaging  errors  result  only  when  the  slacks  s  or  multipliers  z  become 
very  close  to  zero  (or  when  the  Hessian  or  the  Jacobian  matrix  AE  is  almost  rank 

deficient).  For  this  reason,  direct  factorization  techniques  are  considered  the  most  reliable 
techniques  for  computing  steps  in  interior-point  methods. 

Iterative  linear  algebra  techniques  can  also  be  used  for  the  step  computation.  Ill  con¬ 
ditioning  is  a  grave  concern  in  this  context,  and  preconditioners  that  cluster  the  eigenvalues 
of  E  must  be  used.  Fortunately,  such  preconditioners  are  easy  to  construct.  For  example, 
let  us  introduce  the  change  of  variables  ps  —  S~l ps  in  the  system  (19.12),  and  multiply 
the  second  equation  in  (19.12)  by  S ,  transforming  the  term  E  into  STS.  As  p  — >  0  (and 
assuming  that  SZ  pi)  we  have  from  (19.13)  that  all  the  elements  of  STS  cluster  around 
pi.  Other  scalings  can  be  used  as  well.  The  change  of  variables  ps  ~  T  1^2ps  provides  the 
perfect  preconditioner,  while  ps  =  J~pS~lps  transforms  E  to  STS/ p,  which  converges  to 
I  as  p  — >  0. 

We  can  apply  an  iterative  method  to  one  of  the  symmetric  indefinite  systems 
(19.12),  (19.15),  or  (19.16).  The  conjugate  gradient  method  is  not  appropriate  (except 
as  explained  below)  because  it  is  designed  for  positive  definite  systems,  but  we  can  use 
GMRES,  QMR,  or  LSQR  (see  [136]).  In  addition  to  employing  preconditioning  that  re¬ 
moves  the  ill  conditioning  caused  by  the  barrier  approach,  as  discussed  above,  we  need 
to  deal  with  possible  ill  conditioning  caused  by  the  Hessian  or  the  Jacobian  matri¬ 
ces  Ae  and  A,.  General-purpose  preconditioners  are  difficult  to  find  in  this  context,  and 
the  success  of  an  iterative  method  hinges  on  the  use  of  problem-specific  or  structured 
preconditioners. 

An  effective  alternative  is  to  use  a  null-space  approach  to  solve  the  primal-dual  system 
and  apply  the  CG  method  in  the  (positive  definite)  reduced  space.  As  explained  in  Sec¬ 
tion  16.3,  we  can  do  this  by  applying  the  projected  CG  iteration  of  Algorithm  16.2  using  a 
so-called  constraint  preconditioner.  In  the  context  of  the  system  (19.12)  the  preconditioner 
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has  the  form 

G  0  AET  (x)  A/ (x) 
0  T  0  -I 

AE{x)  0  0  0 

A,(x)  -7  0  0 


(19.17) 


where  G  is  a  sparse  matrix  that  is  positive  definite  on  the  null  space  of  the  constraints  and  T 
is  a  diagonal  matrix  that  equals  or  approximates  E .  This  preconditioner  keeps  the  Jacobian 
information  of  AE  and  Al  intact  and  thereby  removes  any  ill  conditioning  present  in  these 
matrices. 


UPDATING  THE  BARRIER  PARAMETER 

The  sequence  of  barrier  parameters  {pk}  must  converge  to  zero  so  that,  in  the  limit, 
we  recover  the  solution  of  the  nonlinear  programming  problem  (19.1).  If  pk  is  decreased 
too  slowly,  a  large  number  of  iterations  will  be  required  for  convergence;  but  if  it  is  decreased 
too  quickly,  some  of  the  slacks  s  or  multipliers  z  may  approach  zero  prematurely,  slowing 
progress  of  the  iteration.  We  now  describe  several  techniques  for  updating  pk  that  have 
proved  to  be  effective  in  practice. 

The  strategy  implemented  in  Algorithm  19.1,  which  we  call  the  Fiacco-McCormick 
approach,  fixes  the  barrier  parameter  until  the  perturbed  KKT  conditions  (19.2)  are  satisfied 
to  some  accuracy.  Then  the  barrier  parameter  is  decreased  by  the  rule 

Hk+ 1  =  (ZkPk ,  with  ak  e  (0,  1).  (19.18) 

Some  early  implementations  of  interior-point  methods  chose  to  be  a  constant  (for  exam¬ 
ple,  —  0.2) .  It  is,  however,  preferable  to  let  (Jk  take  on  two  or  more  values  (for  example,  0.2 
and  0.1),  choosing  smaller  values  when  the  most  recent  iterations  make  significant  progress 
toward  the  solution.  Furthermore,  by  letting  o>  ->  0  near  the  solution,  and  letting  the 
parameter  r  in  (19.9)  converge  to  1,  a  superlinear  rate  of  convergence  can  be  obtained. 

The  Fiacco-McCormick  approach  works  well  on  many  problems,  but  it  can  be  sensitive 
to  the  choice  of  the  initial  point,  the  initial  barrier  parameter  value,  and  the  scaling  of  the 
problem. 

Adaptive  strategies  for  updating  the  barrier  parameter  are  more  robust  in  difficult 
situations.  These  strategies,  unlike  the  Fiacco-McCormick  approach,  vary  p  at  every  it¬ 
eration  depending  on  the  progress  of  the  algorithm.  Most  such  strategies  are  based  on 
complementarity,  as  in  the  linear  programming  case  (see  Framework  14.1),  and  have  the 
form 

T 

sk  Zk 

H'k+ 1  —  &k  5 
m 


(19.19) 
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which  allows  n k  to  reflect  the  scale  of  the  problem.  One  choice  of  o>,  implemented  in  the 
loqo  package  [294],  is  based  on  the  deviation  of  the  smallest  complementarity  product 
M,  [zk\i  from  the  average: 


1 7k  =  0.1  min 


where  £* 


min,-  [sk]j  [Zk]i 
( sk)Tzk/m 


(19.20) 


Here  [,s>],  denotes  the  /  th  component  of  the  iterate  sk,  and  similarly  for  [z*],-.  When  ss  1 
(all  the  individual  products  are  near  to  their  average),  the  barrier  parameter  is  decreased 
aggressively. 

Predictor  or  probing  strategies  (see  Section  14.2)  can  also  be  used  to  determine  the 
parameter  o>  in  (19.19).  We  calculate  a  predictor  (affine  scaling)  direction 


(A.vaff,  Aiaff,  Ayaff,  Azaff) 


by  setting  [i  =  0  in  (19.12).  We  probe  this  direction  by  finding  aaff  and  aff  to  be  the 
longest  step  lengths  that  can  be  taken  along  the  affine  scaling  direction  before  violating 
the  nonnegativity  conditions  (s,  z)  >  0.  Explicit  formulas  for  these  step  lengths  are  given 
by  (19.9)  with  r  =  1.  We  then  define  /uaff  to  be  the  value  of  complementarity  along  the 
(shortened)  affine  scaling  step,  that  is, 

Maff  =  (sk  +  afAsaS)T{zk  +  afAz^/m,  (19.21) 


and  define  ak  as  follows: 


Ok  — 


(  Maff  \ 

\s'[zk/mj 


3 


(19.22) 


This  heuristic  choice  of  was  proposed  for  linear  programming  problems  (see  (14.34)) 
and  also  works  well  for  nonlinear  programs. 


HANDLING  NONCONVEXITY  AND  SINGULARITY 

The  direction  defined  by  the  primal-dual  system  (19.12)  is  not  always  productive 
because  it  seeks  to  locate  only  KKT  points;  it  can  move  toward  a  maximizer  or  other 
stationary  points.  In  Chapter  18  we  have  seen  that  the  Newton  step  (18.9)  for  the  equality- 
constrained  problem  (18.1)  can  be  guaranteed  to  be  a  descent  direction  for  a  large  class  of 
merit  functions — and  to  be  a  productive  direction  for  a  filter — if  the  Hessian  W  is  positive 
definite  on  the  tangent  space  of  the  constraints.  The  reason  is  that,  in  this  case,  the  step 
can  be  interpreted  as  the  minimization  of  a  convex  model  in  the  reduced  space  obtained  by 
eliminating  the  linearized  constraints. 
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For  the  primal-dual  system  (19.12),  the  step  p  is  a  descent  direction  if  the  matrix 


(19.23) 


is  positive  definite  on  the  null  space  of  the  constraint  matrix 


Ae(x)  0 
Mx)  -i 


Lemma  16.3  states  that  this  positive  definiteness  condition  holds  if  the  inertia  of  the  primal- 
dual  matrix  in  (19.12)  is  given  by 


( n  +  m,  l  +  m,  0), 


(19.24) 


in  other  words,  if  this  matrix  has  exactly  n  +  m  positive,  l  +  m  negative,  and  no  zero 
eigenvalues.  (Recall  that  l  and  m  denote  the  number  of  equality  and  inequality  constraints, 
respectively.)  As  discussed  in  Section  3.4,  the  inertia  can  be  obtained  from  the  symmetric- 
indefinite  factorization  of  (19.12). 

If  the  primal-dual  matrix  does  not  have  the  desired  inertia,  we  can  modify  it  as 
follows.  Note  that  the  diagonal  matrix  E  is  positive  definite  by  construction  but  can 
be  indefinite.  Therefore,  we  can  replace  the  latter  matrix  by  Vav£  +  SI,  where  S  >  0  is 
sufficiently  large  to  ensure  that  the  inertia  is  given  by  (19.24).  The  size  of  this  modification 
is  not  known  beforehand,  but  we  can  try  successively  larger  values  of  <5  until  the  desired 
inertia  is  obtained. 

We  must  also  guard  against  singularity  of  the  primal- dual  matrix  caused  by  the  rank 
deficiency  of  AE  (the  matrix  [A,  —  I]  always  has  full  rank).  We  do  so  by  including  a 
regularization  parameter  y  >  0,  in  addition  to  the  modification  term  SI,  and  work  with 
the  modified  primal-dual  matrix 

V;AX  +  SI  0  Ae(x)t 
0  £  0 

Ae(x)  0  - yl 

A^x)  —I  0 

A  procedure  for  selecting  y  and  S  is  given  in  Algorithm  B.l  in  Appendix  B.  It  is  invoked  at 
every  iteration  of  the  interior-point  method  to  enforce  the  inertia  condition  (19.24)  and  to 
guarantee  nonsingularity.  Other  matrix  modifications  to  ensure  positive  definiteness  have 
been  discussed  in  Chapter  3  in  the  context  of  unconstrained  minimization. 


Mx)T  " 

-I 

(19.25) 

0 

0 
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STEP  ACCEPTANCE:  MERIT  FUNCTIONS  AND  FILTERS 

The  role  of  the  merit  function  or  filter  is  to  determine  whether  a  step  is  productive 
and  should  be  accepted.  Since  interior-point  methods  can  be  seen  as  methods  for  solv¬ 
ing  the  barrier  problem  (19.4),  it  is  appropriate  to  define  the  merit  function  </>  or  filter 
in  terms  of  barrier  functions.  We  may  use,  for  example,  an  exact  merit  function  of  the 
form 


m 

<pv(x,  s )  =  f{x)  -  +  v||cE(*)ll  +  v||cr(.r)  -  s||,  (19.26) 

i=i 

where  the  norm  is  chosen,  say,  to  be  the  i\  or  the  ii  norm  (unsquared).  The  penalty 
parameter  v  >  0  can  be  updated  by  using  the  strategies  described  in  Chapter  18. 

In  a  line  search  method,  after  the  step  p  has  been  computed  and  the  maximum  step 
lengths  (19.9)  have  been  determined,  we  perform  a  backtracking  line  search  that  computes 
the  step  lengths 


as  e  (0,  »“x],  az  e  (0,  a”“] ,  (19.27) 

providing  sufficient  decrease  of  the  merit  function  or  ensuring  acceptability  by  the  filter. 
The  new  iterate  is  then  defined  as 

x+—x+aspx,  s+—s  +  asps,  (19.28a) 

y+  =  y  +  azpy,  z+  =  z  +  azP&  (19.28b) 

When  defining  a  filter  (see  Section  15.4)  the  pairs  of  the  filter  are  formed,  on  the  one 
hand,  by  the  values  of  the  barrier  function  f(x)  —  p,  Y?”=  i  log  si  and,  on  the  other  hand,  by 
the  constraint  violations  ||  (cE(x),  c,(x)  —  s)  || .  A  step  will  be  accepted  if  it  is  not  dominated  by 
any  element  in  the  filter.  Under  certain  circumstances,  if  the  step  is  not  accepted  by  the  filter, 
instead  of  reducing  the  step  length  as  in  (19.8a),  a  feasibility  restoration  phase  is  invoked; 
see  the  Notes  and  References  at  the  end  of  the  chapter. 


QUASI-NEWTON  APPROXIMATIONS 

A  quasi-Newton  version  of  the  primal-dual  step  is  obtained  by  replacing  in 

(19.12)  by  a  quasi-Newton  approximation  B.  We  can  use  the  BFGS  (6.19)  or  SRI  (6.24) 
update  formulas  described  in  Chapter  6  to  define  B ,  or  we  can  follow  a  limited-memory  BFGS 
approach  (see  Chapter  7).  It  is  important  to  approximate  the  Hessian  of  the  Lagrangian  of  the 
nonlinear  program,  not  the  Hessian  of  the  barrier  function,  which  is  highly  ill  conditioned 
and  changes  rapidly. 

The  correction  pairs  used  by  the  quasi-Newton  updating  formula  are  denoted  here  by 
(Ax,  A/),  replacing  the  notation  (s,y)  of  Chapter  6.  After  computing  a  step  from  ( x,s,y,z ) 
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to  (x+,  s+,  y+,  z+),  we  define 

A /  =  Vv£(x+,  ,y+,  y+,  z+ )  —  Vx£(x,  s+,  y+,  z+). 
Ax  =  x+  —  x. 


To  ensure  that  the  BFGS  method  generates  a  positive  definite  matrix,  one  can  skip 
or  damp  the  update;  see  (18.14)  and  (18.15).  SRI  updating  must  be  safeguarded  to  avoid 
unboundedness,  as  discussed  in  Section  6.2,  and  may  also  need  to  be  modified  so  that  the 
inertia  of  the  primal-dual  matrix  is  given  by  (19.24).  This  modification  can  be  performed 
by  means  of  Algorithm  B.  1 . 

The  quasi-Newton  matrices  B  generated  in  this  manner  are  dense  n  x  n  matrices.  For 
large  problems,  limited-memory  updating  is  desirable.  One  option  is  to  implement  a  limited- 
memory  BFGS  method  by  using  the  compact  representations  described  in  Section  7.2.  Here 
B  has  the  form 


B  =  £/  +  WMWt,  (19.29) 

where  £  >  0  is  a  scaling  factor,  W  is  an  n  x  2m  matrix,  M  is  a  2m  x  2m  symmet¬ 
ric  and  nonsingular  matrix,  and  m  denotes  the  number  of  correction  pairs  saved  in  the 
limited-memory  updating  procedure.  The  matrices  W  and  M  are  formed  by  using  the  vec- 
tors{A/A’}  and  { Ax^l  accumulated  in  the  lastm  iterations.  Since  the  limited-memory  matrix 
B  is  positive  definite,  and  assuming  AE  has  full  rank,  the  primal-dual  matrix  is  nonsingular, 
and  we  can  compute  the  solution  to  (19.12)  by  inverting  the  coefficient  matrix  using  the 
Sherman-Morrison-Woodbury  formula  (see  Exercise  19.14). 

FEASIBLE  INTERIOR-POINT  METHODS 

In  many  applications,  it  is  desirable  for  all  of  the  iterates  generated  by  an  optimization 
algorithm  to  be  feasible  with  respect  to  some  or  all  of  the  inequality  constraints.  For  example, 
the  objective  function  maybe  defined  only  when  some  of  the  constraints  are  satisfied,  making 
this  feature  essential. 

Interior-point  methods  provide  a  natural  framework  for  deriving  feasible  algorithms. 
If  the  current  iterate  x  satisfies  cr(x)  >  0,  then  it  is  easy  to  adapt  the  primal-dual  iteration 
(19.12)  so  that  feasibility  is  preserved.  After  computing  the  step  p,  we  let  x+  —  x  +  px, 
redefine  the  slacks  as 

s+  «- c:(x+),  (19.30) 

and  test  whether  the  point  (x+,  j+)  is  acceptable  for  the  merit  function  <p.  If  so,  we  define 
this  point  to  be  the  new  iterate;  otherwise  we  reject  the  step  p  and  compute  a  new,  shorter 
trial  step.  In  a  line  search  algorithm  we  backtrack,  and  in  a  trust-region  method  we  compute 
a  new  step  with  a  reduced  trust-region  bound.  This  strategy  is  justified  by  the  fact  that  if  at 
a  trial  point  we  have  that  c,  (x+)  <  0  for  some  inequality  constraint,  the  value  of  the  merit 
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function  is  +oo,  and  we  reject  the  trial  point.  We  will  also  reject  steps  x  +  px  that  are  too 
close  to  the  boundary  of  the  feasible  region  because  such  steps  increase  the  barrier  term 
—/i  J2i€ i  logOh  )  in  the  merit  function  ( 19.26). 

Making  the  substitution  (19.30)  has  the  effect  of  replacing  logfe)  with  log(c,  (x))  in 
the  merit  function,  a  technique  reminiscent  of  the  classical  primal  log-barrier  approach 
discussed  in  Section  19.6. 


19.4  A  LINE  SEARCH  INTERIOR-POINT  METHOD 


We  now  give  a  more  detailed  description  of  a  line  search  interior-point  method.  We  denote 
by  D<p(x ,  s',  p)  the  directional  derivative  of  the  merit  function  <pv  at  (x,  s)  in  the  direction 
p.  The  stopping  conditions  are  based  on  the  error  function  (19.10). 

Algorithm  19.2  (Line  Search  Interior-Point  Alsorithm). 

Choose  .Vo  and  so  >  0,  and  compute  initial  values  for  the  multipliers  yo  and  zo  >  0. 
If  a  quasi-Newton  approach  is  used,  choose  an  n  x  n  symmetric  and  positive  definite  initial 
matrix  B0.  Select  an  initial  barrier  parameter  p  >  0,  parameters??,  a  e  (0,  1),  and  tolerances 
e/x  and  eTOL  ■  Set  k  0. 

repeat  until  E(xk,  sk,  yk,  z,k\  0)  <  eTOL 
repeat  until  E(xk,  sk,  yk,  zk\  M)  < 

Compute  the  primal-dual  direction  p  =  ( px ,  ps ,  py ,  pz)  from 
(19.12),  where  the  coefficient  matrix  is  modified  as  in 
(19.25),  if  necessary; 

Compute  a”“,  a"“  using  ( 19.9);  Set  pw  =  (px,  ps); 

Compute  step  lengths  as,  az  satisfying  both  ( 19.27)  and 
(pv{xk  +  aspx,  sk+asps)  <  (j)v{xk,  sk )  +  r)asD(/)v(xk,  sk;  pw ); 

Compute  (xk+] ,  sk+u  yk+1,zk+i)  using  (19.28); 
if  a  quasi-Newton  approach  is  used 
update  the  approximation  Bk; 

Set  k  a-  k  +  1; 

end 

Set  a-  a  p  and  update 

end 


The  barrier  tolerance  can  be  defined,  for  example,  as  —  pi,  as  in  Algorithm  19.1.  An 
adaptive  strategy  that  updates  the  barrier  parameter  /x  at  every  step  is  easily  implemented 
in  this  framework.  If  the  merit  function  can  cause  the  Maratos  effect  (see  Section  15.4), 
a  second-order  correction  or  a  nonmonotone  strategy  should  be  implemented.  An  al¬ 
ternative  to  using  a  merit  function  is  to  employ  a  filter  mechanism  to  perform  the  line 
search. 
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We  will  see  in  Section  19.7  that  Algorithm  19.2  must  be  safeguarded  to  ensure  global 
convergence. 

1 9.5  A  TRUST-REGION  INTERIOR-POINT  METHOD 


We  now  consider  an  interior-point  method  that  uses  trust  regions  to  promote  convergence. 
As  in  the  unconstrained  case,  the  trust-region  formulation  allows  great  freedom  in  the 
choice  of  the  Hessian  and  provides  a  mechanism  for  coping  with  Jacobian  and  Hessian 
singularities.  The  price  to  pay  for  this  flexibility  is  a  more  complex  iteration  than  in  the  line 
search  approach. 

The  interior-point  method  described  below  is  asymptotically  equivalent  to  the  line 
search  method  discussed  in  Section  19.4,  but  differs  significantly  in  two  respects.  First,  it 
is  not  fully  a  primal-dual  method  in  the  sense  that  it  first  computes  a  step  in  the  variables 
(x,  s)  and  then  updates  the  estimates  for  the  multipliers,  as  opposed  to  the  approach  of 
Algorithm  19.1,  in  which  primal  and  dual  variables  are  computed  simultaneously.  Second, 
the  trust-region  method  uses  a  scaling  of  the  variables  that  discourages  moves  toward  the 
boundary  of  the  feasible  region.  This  causes  the  algorithm  to  generate  steps  that  can  be 
different  from,  and  enjoy  more  favorable  convergence  properties  than,  those  produced  by  a 
line  search  method. 

We  first  describe  a  trust-region  algorithm  for  finding  approximate  solutions  of  a  fixed 
barrier  problem.  We  then  present  a  complete  interior-point  method  in  which  the  barrier 
parameter  is  driven  to  zero. 

AN  ALGORITHM  FOR  SOLVING  THE  BARRIER  PROBLEM 

The  barrier  problem  (19.4)  is  an  equality-constrained  optimization  problem  and 
can  be  solved  by  using  a  sequential  quadratic  programming  method  with  trust  regions. 
A  straightforward  application  of  SQP  techniques  to  the  barrier  problem  leads,  however,  to 
inefficient  steps  that  tend  to  violate  the  positivity  of  the  slack  variables  and  are  frequently  cut 
short  by  the  trust-region  constraint.  To  overcome  this  problem,  we  design  an  SQP  method 
tailored  to  the  structure  of  barrier  problems. 

At  the  iterate  (a,  s),  and  for  a  given  barrier  parameter  /z,  we  first  compute  Lagrange 
multiplier  estimates  (y,  z)  and  then  compute  a  step  p  —  ( px ,  ps)  that  approximately  solves 
the  subproblem 


min 

Px,Ps 

V/V  +  \pTxV2xxCp*  -  peTS~lPs  +  -2p] YPs 

(19.31a) 

subject  to 

AE(x)px  +  cE(x)  =  rE, 

(19.31b) 

A,(x)px  -  ps  +  (c:(x)  -s)  =  r„ 

(19.31c) 

ll(/>*,S_1p,)||2  <  A, 

(19.31d) 

Ps  >  -TS. 

( 19.3  le) 
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Here  E  is  the  primal-dual  matrix  (19.13),  and  the  scalar  r  e  (0,  1)  is  chosen  close  to  1 
(for  example,  0.995).  The  inequality  (19.31e)  plays  the  same  role  as  the  fraction  to  the 
boundary  rule  (19.9).  Ideally,  we  would  like  to  set  r  —  (rE,  rr)  =  0,  but  since  this  can  cause 
the  constraints  (19.3 lb)— ( 19.3 Id)  to  be  incompatible  or  to  give  a  step  p  that  makes  little 
progress  toward  feasibility,  we  choose  the  parameter  r  by  an  auxiliary  computation,  as  in 
Algorithm  18.4. 

We  motivate  the  choice  of  the  objective  ( 19.3  la)  by  noting  that  the  first-order  optimal¬ 
ity  conditions  of  (19.31a)-(19.31c)  are  given  by  (19.2)  (with  the  second  block  of  equations 
scaled  by  S~l).  Thus  the  step  computed  from  the  subproblem  (19.31)  is  related  to  the 
primal-dual  line  search  step  in  the  same  way  as  the  SQP  and  Newton-Lagrange  steps  of 
Section  18.1. 

The  trust-region  constraint  (19.31d)  guarantees  that  the  problem  (19.31)  has  a  finite 
solution  even  when  Vjt£(v,  s,  y,  z)  is  not  positive  definite,  and  therefore  this  Hessian  need 
never  be  modified.  In  addition,  the  trust-region  formulation  ensures  that  adequate  progress 
is  made  at  every  iteration.  To  justify  the  scaling  S-1  used  in  (19.3  Id),  we  note  that  the  shape 
of  the  trust  region  must  take  into  account  the  requirement  that  the  slacks  not  approach  zero 
prematurely.  The  scaling  S~l  serves  this  purpose  because  it  restricts  those  components  i  of 
the  step  vector  ps  for  which  Si  is  close  to  its  lower  bound  of  zero.  As  we  see  below,  it  also 
plays  an  important  role  in  the  choice  of  the  relaxation  vectors  rE  and  r,. 

We  outline  this  SQP  trust-region  approach  as  follows.  The  stopping  condition  is 
defined  in  terms  of  the  error  function  E  given  by  ( 19.10),  and  the  merit  function  <pv  can  be 
defined  as  in  (19.26)  using  the  2-norm,  ||  •  ||2. 

Algorithm  19.3  (Trust-Resion  Alsorithm  for  Barrier  Problems). 

Input  parameters:  pi  >  0,  xq,  Sq  >  0,  efl,  and  A0  >  0.  Compute  Lagrange  multiplier 
estimates  y0  and  zo  >  0.  Set  k  <—  0. 

repeat  until  E{xk,  sk,  yk,  zk\  n)  < 

Compute  p  —  ( px ,  ps)  by  approximately  solving  (19.31). 

if  p  provides  sufficient  decrease  in  the  merit  function  <pv 
Set  xk+l  <  xk  -j-  px ,  sk. |_i  <  sk  A  Ps'-, 

Compute  new  multiplier  estimates  y^+i.z^+i  >  0 
and  set  A*+i  >  Ak; 

else 

Define  Xi-+i  xk,  sk+ 1  <-  sk,  and  set  A^+i  <  A*-; 

end 

Set  k  k  +  1; 

end  (repeat) 

Algorithm  19.3  is  applied  for  a  fixed  value  of  the  barrier  parameter  pi.  A  complete 
interior-point  algorithm  driven  by  a  sequence  {p.k}  — >  0  is  described  below.  First,  we 
discuss  how  to  find  an  approximate  solution  of  the  subproblem  ( 19.3 1),  along  with  Lagrange 
multiplier  estimates  (yk+\ ,  Zk+i). 
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STEP  COMPUTATION 

The  subproblem  (19.31a)-(19.31e)  is  difficult  to  minimize  exactly  because  of  the 
presence  of  the  nonlinear  constraint  (19.3  Id)  and  the  bounds  ( 19.3 le).  An  important 
observation  is  that  we  can  compute  useful  inexact  solutions,  at  moderate  cost.  Since  this 
approach  scales  up  well  with  the  number  of  variables  and  constraints,  it  provides  a  framework 
for  developing  practical  interior-point  methods  for  large-scale  optimization. 

The  first  step  in  the  solution  process  is  to  make  a  change  of  variables  that  transforms 
the  trust- region  constraint  (19.3  Id)  into  a  ball.  By  defining 


1 

^3 

_ 1 

i 

5-1 

_ 1 

L 

1 

to 

7 

Co 

(19.32) 


we  can  write  problem  (19.31)  as 


min 

V/7>,  +  ^pTxV2xxCpx  -  peTps  +  ^p'SZSps 

(19.33a) 

Px,Ps 

subject  to 

AE(x)px  +  cE(x)  —  rE, 

(19.33b) 

A, (x)px  -  Sps  +  (c:(x)  -  s)  =  ru 

(19.33c) 

life,  Ps)h  <  a, 

(19.33d) 

1 

Al 

'a, 

(19.33e) 

To  compute  the  vectors  rE  and  we  proceed  as  in  Section  18.5  and  formulate  the  following 
normal  subproblem  in  the  variable  v  —  ( vx ,  vs): 

min  ||  AE(x)vx  +  cE(x)||?  +  ||A,(xfe  -  Svs  +  (c,(x)  -  Jf)||| 

V 

(19.34a) 

subjectto  life,  ttJIU  <  0.8A,  (19.34b) 

vs  >  — (t/2)e.  (19.34c) 

If  we  ignore  (19.34c),  this  problem  has  the  standard  form  of  a  trust-region  problem,  and  we 
can  compute  an  approximate  solution  by  using  the  techniques  discussed  in  Chapter  4,  such 
as  the  dogleg  method.  If  the  solution  violates  the  bounds  (19.34c),  we  can  backtrack  so  that 
these  bounds  are  satisfied. 

Having  solved  (19.34),  we  define  the  vectors  rE  and  rr  in  (19.33b)-(19.33c)  to  be  the 
residuals  in  the  normal  step  computation,  namely, 

rE  =  AE{x)vx  +  cE(x),  r,  =  A:(x) vx  -  Svs  +  (cr(x)  -  s).  (19.35) 

We  are  now  ready  to  compute  an  approximate  solution  d  of  the  subproblem  ( 1 9.33) .  By 
( 19.35),  the  vector  v  is  a  particular  solution  of  the  linear  constraints  (19.33b)-(19.33c).  We 
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can  then  solve  the  equality- constrained  quadratic  program  ( 19.33a)— ( 19.33c)  by  using  the 
projected  conjugate  gradient  iteration  given  in  Algorithm  16.2.  We  terminate  the  projected 
CG  iteration  by  Steihaug’s  rules:  During  the  solution  by  CG  we  monitor  the  satisfaction 
of  the  trust-region  constraint  (19.33d)  and  stop  if  the  boundary  of  this  region  is  reached, 
if  negative  curvature  is  detected,  or  if  an  approximate  solution  is  obtained.  If  the  solution 
given  by  the  projected  CG  iteration  does  not  satisfy  the  bounds  ( 19.33e),  we  backtrack  so 
that  they  are  satisfied.  After  the  step  (px,  ps)  has  been  computed,  we  recover  p  from  (19.32). 

As  discussed  in  Section  16.3,  every  iteration  of  the  projected  CG  iteration  requires  the 
solution  of  a  linear  system  in  order  to  perform  the  projection  operation.  For  the  quadratic 
program  ( 19.33a)— ( 19.33c)  this  projection  matrix  is  given  by 


I 

AT  " 

Ae(x) 

0 

A 

0 

,  with  A  = 

A  Ax) 

-5 

Thus,  although  this  trust- region  approach  still  requires  the  solution  of  an  augmented  system, 
the  matrix  (19.36)  is  simpler  than  the  primal-dual  matrix  (19. 12).  In  particular,  the  Hessian 
Vxx£  need  never  be  factored  because  the  CG  approach  requires  only  products  of  this  matrix 
with  vectors. 

We  mentioned  in  Section  19.3  that  the  term  STS  in  (19.33a)  has  a  much  tighter 
distribution  of  eigenvalues  than  E .  Therefore  the  CG  method  will  normally  not  be  adversely 
affected  by  ill  conditioning  and  is  a  viable  approach  for  solving  the  quadratic  program 
(19.33a)-(19.33c). 

LAGRANGE  MULTIPLIERS  ESTIMATES  AND  STEP  ACCEPTANCE 

At  an  iterate  (x,  s),  we  choose  (y,  z)  to  be  the  least-squares  multipliers  (see  ( 18.21)) 
corresponding  to  ( 19. 33a)-(  19.33c).  We  obtain  the  formula 


y 

/  '  ~t\-1  - 

'  V/(JC)  ' 

z 

II 

—fie 

(19.37) 


where  A  is  given  by  ( 19.36)  The  multiplier  estimates  z  obtained  in  this  manner  may  not 
always  be  positive;  to  enforce  positivity,  we  may  redefine  them  as 

H  <—  min(lCT3,  p/si),  i  =  1,2,...,  m.  (19.38) 

The  quantity  /x/s,  is  called  the  /  th  primal  multiplier  estimate  because  if  all  components  of 
Z  were  defined  by  (19.38),  then  E  would  reduce  to  the  primal  choice,  (19.14). 

As  is  standard  in  trust-region  methods,  the  step  p  is  accepted  if 


ared(/?)  >  ?;pred (/?), 


(19.39) 


582  Chapter  19.  Nonlinear  Interior  Methods 


where 


ared(p)  =  <pv{x,  s)  -  <pv{x  +  px,s  +  ps)  (19.40) 

and  where  77  is  a  constant  in  (0,  1)  (say,  ?;  =  1CT8).  The  predicted  reduction  is  defined  as 

pred(p)  =  <7„(0)  —  qv(p),  (19.41) 


where  qv  is  defined  as 

qv(p)  =  ^  f  Px  +  ^PxVxx£pX  ~  peTS~lps  +  p' T,ps  +  vm{p), 

and 


m(p)  - 


AE(x)px  +  cE(x) 
Mx)px  ~  Ps  +  c,(x)  -  S' 


To  determine  an  appropriate  value  of  the  penalty  parameter  v,  we  require  that  v  be 
large  enough  that 


pred(p)  >  pv{m( 0)  —  m(p)),  (19.42) 

for  some  parameter  p  e  (0,  1).  This  is  the  same  as  condition  (18.35)  used  in  Section  18.5, 
and  the  value  of  v  can  be  computed  by  the  procedure  described  in  that  section. 

DESCRIPTION  OF  A  TRUST-REGION  INTERIOR-POINT  METHOD 

We  now  present  a  more  detailed  description  of  the  trust-region  interior-point  algo¬ 
rithm  for  solving  the  nonlinear  programming  problem  (19.1).  For  concreteness  we  follow 
the  Fiacco-McCormick  strategy  for  updating  the  barrier  parameter.  The  stopping  conditions 
are  stated,  once  more,  in  terms  of  the  error  function  E  defined  by  ( 1 9. 1 0) .  In  a  quasi-Newton 
approach,  the  Hessian  is  replaced  by  a  symmetric  approximation. 

Algorithm  19.4  (Trust-Resion  Interior-Point  Alsorithm). 

Choose  a  value  for  the  parameters  rj  >  0,  r  e  (0,  1),  a  e  (0,  1),  and  £  e  (0,  1),  and 
select  the  stopping  tolerances  eM  and  eT0L .  If  a  quasi-Newton  approach  is  used,  select  an 
n  x  n  symmetric  initial  matrix  B0.  Choose  initial  values  for  pi  >  0,  xq,  sq  >  0,  and  Ao-  Set 
k  < —  0. 

repeat  until  E{xk,  sk,  yk,  Zk\  0)  <  emL 

repeat  until  E{xk,  sk,  yk,  z,k\  p)  <  e,, 

Compute  Lagrange  multipliers  from  ( 19.37)— ( 19.38); 
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Compute  V*xC{xk,  .s> ,  }%,  Zk )  or  upate  a  quasi-Newton 
approximation  Bk,  and  define  L*  by  (19.13); 

Compute  the  normal  step  Vk  —  (vx,vs); 

Compute  pk  by  applying  the  projected  CG  method  to  (19.33); 

Obtain  the  total  step  pk  from  (19.32); 

Update  Vk  to  satisfy  (19.42); 

Compute  pred kiPk)  by  ( 19.41)  and  ared kiPk)  by  (19.40); 
if  ared^(p^)  >  rjpredk{pk) 

Set  Xk+l  *  Xk  Px  ,  $k+ 1  *  Sk  T"  Ps, 

Choose  Ak+\  >  Ak; 

else 

set  Xk+ 1  =  Xk,  Sk+ 1  =  Sk'y  and  choose  A/..+1  <  A k\ 

endif 

Set  k  <—  k  +  1; 

end 

Set  p.  <—  a p  and  update 

end 

The  merit  function  (19.26)  can  reject  steps  that  make  good  progress  toward  a  solution: 
the  Maratos  effect  discussed  in  Chapter  18.  This  deficiency  can  be  overcome  by  selective 
application  of  a  second-order  correction  step;  see  Section  15.4. 

Algorithm  19.4  can  easily  be  modified  to  implement  an  adaptive  barrier  update 
strategy.  The  barrier  stop  tolerance  can  be  defined  as  e ^  —  p.  Algorithm  19.4  is  the  basis 
of  the  knitro/cg  method  [50],  which  implements  both  exact  Hessian  and  quasi-Newton 
options. 
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Prior  to  the  introduction  of  primal-dual  interior  methods,  barrier  methods  worked  in  the 
space  of  primal  variables  x.  As  in  the  quadratic  penalty  function  approach  of  Chapter  17, 
the  goal  was  to  solve  nonlinear  programming  problems  by  unconstrained  minimization 
applied  to  a  parametric  sequence  of  functions. 

Primal  barrier  methods  are  more  easily  described  in  the  context  of  inequality- 
constrained  problems  of  the  form 

min  f(x)  subject  to  c{x)  >  0.  (19.43) 

a: 

The  log-barrier  function  is  defined  by 

P{x\  p)  —  f{x)  ~  P  ^2  l°gc;(x), 
iel 


(19.44) 
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where  p  >  0.  One  can  show  that  the  minimizers  of  P(x\  p),  which  we  denote  by  x{p), 
approach  a  solution  of  (19.43)  as  p  4-  0,  under  certain  conditions;  see,  for  example,  [111]. 
The  trajectory  Cp  defined  by 


Cp  =  [x(p)  |  p  >  0}  (19.45) 

is  often  referred  to  as  the  primal  central  path. 

Since  the  minimizer  x(p)  of  P(.r;  pi)  lies  in  the  strictly  feasible  set  {x  \  c(x)  >  0} 
(where  no  constraints  are  active),  we  can  in  principle  search  for  it  by  using  any  of  the  uncon¬ 
strained  minimization  algorithms  described  in  the  first  part  of  this  book.  These  methods 
need  to  be  modified,  as  explained  in  the  discussion  following  equation  (19.30),  so  that  they 
reject  steps  that  leave  the  feasible  region  or  are  too  close  to  the  constraint  boundaries. 

One  way  to  obtain  an  estimate  of  the  Lagrange  multipliers  is  based  on  differentiating 
P  to  obtain 


VxP(x-,n)  =  Vf{x)~Y, 

iel 


Ci(x) 


Vc,(x). 


(19.46) 


When  x  is  close  to  the  minimizer  x(pt)  and  pc  is  small,  we  see  from  Theorem  12.1  that  the 
optimal  Lagrange  multipliers  z*,i  e  1,  can  be  estimated  as  follows: 

z*  &  pt/Ci(x),  iel.  (19.47) 

A  general  framework  for  algorithms  based  on  the  primal  log-barrier  function  (19.44) 
can  be  specified  as  follows. 

Framework  19.5  (Unconstrained  Primal  Barrier  Method). 

Given  p0  >  0,  a  sequence  {t>}  with  t>  -a  0,  and  a  starting  point  Xg; 
for  k  —  0,  1,  2, . . . 

Find  an  approximate  minimizer  xu  of  P(-\  pk),  starting  at  xsk, 
and  terminating  when  ||VP(.yt;  Pk)\\  <  Tk'-> 

Compute  Lagrange  multipliers  Zk  by  (19.47); 
if  final  convergence  test  satisfied 

stop  with  approximate  solution  ay; 

Choose  new  penalty  parameter  pik+i  <  Pk\ 

Choose  new  starting  point  -Lt+i> 
end  (for) 

The  primal  barrier  approach  was  first  proposed  by  Frisch  [115]  in  the  1950s  and  was 
analyzed  and  popularized  by  Fiacco  and  McCormick  [98]  in  the  late  1960s.  It  fell  out  of 
favor  after  the  introduction  of  SQP  methods  and  has  not  regained  its  popularity  because  it 
suffers  from  several  drawbacks  compared  to  primal-dual  interior-point  methods.  The  most 
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important  drawback  is  that  the  minimizer  x(/z)  becomes  more  and  more  difficult  to  find  as 
fi  4-  0  because  of  the  nonlinearity  of  the  function  P(x;  n) 


□  Example  1 9.1 

Consider  the  problem 

min(xi  +  0.5)2  +  (x2  —  0.5)2  subject  to  X\  e  [0,  1],  x2  0  [0,  1],  (19.48) 

for  which  the  primal  barrier  function  is 

P(x;  fi)  =  (xi  +  0.5)2  +  (x2  -  0.5)2  (19.49) 

-  P  [logx,  +  log(l  -  Xi)  +  logx2  +  log(l  -  x2)] . 


Contours  of  this  function  for  the  value  fi  —  0.01  are  plotted  in  Figure  19.1.  The 
elongated  nature  of  the  contours  indicates  bad  scaling,  which  causes  poor  performance 
of  unconstrained  optimization  methods  such  as  quasi-Newton,  steepest  descent,  and  con¬ 
jugate  gradient.  Newton’s  method  is  insensitive  to  the  poor  scaling,  but  the  nonelliptical 
property — the  contours  in  Figure  19.1  are  almost  straight  along  the  left  edge  while  being 
circular  along  the  right  edge — indicates  that  the  quadratic  approximation  on  which  New¬ 
ton’s  method  is  based  does  not  capture  well  the  behavior  of  the  barrier  function.  Hence, 
Newton’s  method,  too,  may  not  show  rapid  convergence  to  the  minimizer  of  ( 19.49)  except 
in  a  small  neighborhood  of  this  point. 


Figure  19.1 

Contours  of  P{x;  fi)  from 
(19.49)  for  fi  =  0.01 
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To  lessen  this  nonlinearity,  we  can  proceed  as  in  (17.21)  and  introduce  additional 
variables.  Defining  Zi  —  /i/cj(x),  we  rewrite  the  stationarity  condition  (19.46)  as 


V/(x)  -  Y'ZjVcdx)  —  0,  (19.50a) 

iel 

C{x)z  —  yte  —  0,  (19.50b) 


where  C(x)  =  diag(ci(x),  C2W, . . . ,  cm(x)).  Note  that  this  system  is  equivalent  to  the 
perturbed  KKT  conditions  (19.2)  for  problem  ( 19.43)  if,  in  addition,  we  introduce  slacks  as 
in  (19. 2d).  Finally,  if  we  apply  Newton’s  method  in  the  variables  (x,  s,  z)  and  temporarily 
ignore  the  bounds  s,z  >  0,  we  arrive  at  the  primal-dual  formulation.  Thus,  with  hindsight, 
we  can  transform  the  primal  log-barrier  approach  into  the  primal-dual  line  search  approach 
of  Section  19.4  or  into  the  trust-region  algorithm  of  Section  19.5. 

Other  drawbacks  of  the  classical  primal  barrier  approach  are  that  it  requires  a  feasible 
initial  point,  which  can  be  difficult  to  find  in  many  cases,  and  that  the  incorporation 
of  equality  constraints  in  a  primal  function  is  problematic.  (A  formulation  in  which  the 
equality  constraints  are  replaced  by  quadratic  penalties  suffers  from  the  shortcomings  of 
quadratic  penalty  functions  discussed  in  Section  17.1.) 

The  shortcomings  of  the  primal  barrier  approach  were  attributed  for  many  years  to 
the  ill  conditioning  of  the  Hessian  of  the  barrier  function  P .  Note  that 


V2xP(x;  fi)  =  V2/(x)  -  V  — y— V2c,-(x)  +  V  -^-Vci(x)Vc,(x)r.  (19.51) 

TtiCi{x)  Ttic^x) 


By  substituting  (19.47)  into  (19.51)  and  using  the  definition  (12.33)  of  the  Lagrangian 
£(x,  z),  we  find  that 


VxxP(x;  fi) 


VlC(x,z*)  +  J2~^i 

iel  M 


2Vc;(x)Vc;(x)7 


(19.52) 


Note  the  similarity  of  this  expression  to  the  Hessian  of  the  quadratic  penalty  function  (17.19). 
Analysis  of  the  matrix  VxxT’(x;  /u  )  shows  that  it  becomes  increasingly  ill  conditioned  near 
the  minimizer  x(/u),  as  /z  approaches  zero. 

This  ill  conditioning  will  be  detrimental  to  the  performance  of  the  steepest  descent, 
conjugate  gradient,  or  quasi-Newton  methods.  It  is  therefore  correct  to  identify  ill  condi¬ 
tioning  as  a  source  of  the  difficulties  of  unconstrained  primal  barrier  functions  that  use  these 
unconstrained  methods.  Newton’s  method  is,  however,  not  affected  by  ill  conditioning,  but 
its  performance  is  still  not  satisfactory.  As  explained  above,  it  is  the  high  nonlinearity  of  the 
primal  barrier  function  P  that  poses  significant  difficulties  to  Newton’s  method. 
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19.7  GLOBAL  CONVERGENCE  PROPERTIES 


We  now  study  some  global  convergence  properties  of  the  primal-dual  interior-point  methods 
described  in  Sections  19.4and  19.5. Theorem  19.1  provides  the  starting  point  for  the  analysis. 
It  gives  conditions  under  which  limit  points  of  the  iterates  generated  by  the  interior-point 
methods  are  KKT  points  for  the  nonlinear  problem.  Theorem  19.1  relies  on  the  assumption 
that  the  perturbed  KKT  conditions  (19.2)  can  be  satisfied  (to  a  certain  accuracy)  for  every 
value  of  )Xk.  In  this  section  we  study  conditions  under  which  this  assumption  holds,  that 
is,  conditions  that  guarantee  that  our  algorithms  can  find  stationary  points  of  the  barrier 
problem  (19.4). 

We  begin  with  a  surprising  observation.  Whereas  the  line  search  primal-dual  ap¬ 
proach  is  the  basis  of  globally  convergent  interior-point  algorithms  for  linear  and  quadratic 
programming,  it  is  not  guaranteed  to  be  successful  for  nonlinear  programming,  even  for 
nondegenerate  problems. 

FAILURE  OF  THE  LINE  SEARCH  APPROACH 

We  have  seen  in  Chapter  1 1  that  line  search  Newton  iterations  for  nonlinear  equations 
can  fail  when  the  Jacobian  loses  rank.  We  now  discuss  a  different  kind  of  failure  specific  to 
interior-point  methods.  It  is  caused  by  the  lack  of  coordination  between  the  step  computation 
and  the  imposition  of  the  bounds. 


□  Example  19.2  (Wachterand  Biegler  [299]) 

Consider  the  problem 

min  x  (19.53a) 

subject  to  C'i  (x)  —  s  =  x2  —  s 1  —  1  —  0,  (19.53b) 

C2(x)  —  s  =f  x  —  S2  —  |  =  0,  (19.53c) 

$i  >0,  s2  >  0.  (19.53d) 


Note  that  the  Jacobian  of  the  equality  constraints  ( 19.53b)— ( 19.53c)  with  respect  to  (x,  s) 
has  full  rank  everywhere.  Let  us  apply  a  line  search  interior-point  method  of  the  form  (19.6)- 
(19.9),  starting  from  an  initial  point  x®  such  that  (.sj01,  )  >  0,  and  Ci(x^)  —  s®  >  0. 

(In  this  example,  we  use  superscripts  to  denote  iteration  indices.)  Figure  19.2  illustrates 
the  feasible  region  (the  dotted  segment  of  the  parabola)  and  the  initial  point,  all  projected 
onto  the  x-Si  plane.  The  primal-dual  step,  which  satisfies  the  linearization  of  the  constraints 
(19.53b)— ( 19.53c),  leads  from  x(0)  to  the  tangent  to  the  parabola.  Here  p\  and  p2  are 
examples  of  possible  steps  satisfying  the  linearization  of  (19.53b)-(19.53c).  The  new  iterate 
x^  therefore  lies  between  x^0>  and  this  tangent,  but  since  Si  must  remain  positive,  x(1)  will 
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lie  above  the  horizontal  axis.  Thus,  from  any  starting  point  above  the  x-axis  and  to  the  left 
of  the  parabola,  namely,  in  the  region 

{(x,  .v, ,  s2)  :  x2  —  si  —  1  >  0,  Si  >  0},  (19.54) 

the  new  iterate  will  remain  in  this  region.  The  argument  can  now  be  repeated  to  show  that 
the  iterates  {x(,;) }  never  leave  the  region  ( 19.54)  and  therefore  never  become  feasible. 

This  convergence  failure  affects  any  method  that  generates  directions  that  satisfy  the 
linearization  of  the  constraints  (19.53b)-(19.53c)  and  that  enforces  the  bounds  (19.53d)  by 
the  fraction  to  the  boundary  rule  (19.8).  The  merit  function  can  only  restrict  the  step  length 
further  and  is  therefore  incapable  of  resolving  the  difficulties.  The  strategy  for  updating  fi 
is  also  irrelevant  because  the  argument  given  above  makes  use  only  of  the  linearizations  of 
the  constraints.  _ 


These  difficulties  can  be  observed  when  practical  line- search  codes  are  applied  to  the 
problem  (19.53).  For  a  wide  range  of  starting  points  in  the  region  (19.54),  the  interior- 
point  iteration  converges  to  points  of  the  form  (— fi,  0,  0),  with  fi  >  0.  In  other  words, 
the  iterates  can  converge  to  an  infeasible,  non-optimal  point  on  the  boundary  of  the 
set  {(xi,si,S2)  :  «i  >  0,  >  0},  a  situation  that  barrier  methods  are  supposed  to 

prevent.  Furthermore,  such  limit  points  are  not  stationary  for  a  feasibility  measure  (see 
Definition  17.1). 
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Failures  of  this  type  are  rare  in  practice,  but  they  highlight  a  theoretical  deficiency  of 
the  algorithmic  class  ( 19.6) — (19.9)  that  may  manifest  itself  more  often  as  inefficient  behavior 
than  as  outright  convergence  failure. 


MODIFIED  LINE  SEARCH  METHODS 

To  remedy  this  problem,  as  well  as  the  inefficiencies  caused  by  Hessian  and  constraint 
Jacobian  singularities,  we  must  modify  the  search  direction  of  the  line  search  interior-point 
iteration  in  some  circumstances.  One  option  is  to  use  penalizations  of  the  constraints 
[147].  Such  penalty-barrier  methods  have  been  investigated  only  recently  and  mature 
implementations  have  not  yet  emerged. 

An  approach  that  has  been  successful  in  practice  is  to  monitor  the  step  lengths  as ,  az 
in  (19.28);  if  they  are  smaller  than  a  given  threshold,  then  we  replace  the  primal-dual  step 
by  a  step  that  guarantees  progress  in  feasibility  and,  preferably,  improvement  in  optimality, 
too.  In  a  filter  method,  when  the  step  lengths  are  very  small,  we  can  invoke  the  feasibility 
restoration  phase  (see  Section  15.4),  which  is  designed  to  generate  a  new  iterate  that  reduces 
the  infeasibility.  A  different  approach,  which  assumes  that  a  trust-region  algorithm  is  at 
hand,  is  to  replace  the  primal-dual  step  by  a  trust-region  step,  such  as  that  produced  by 
Algorithm  19.4. 

Safeguarding  the  primal-dual  step  when  the  step  lengths  are  very  small  is  justified 
theoretically  because,  when  line  search  iterations  converge  to  non- stationary  points,  the 
step  lengths  as ,  az  converge  to  zero.  From  a  practical  perspective,  however,  this  strategy  is 
not  totally  satisfactory  because  it  attempts  to  react  when  bad  steps  are  generated,  rather 
than  trying  to  prevent  them.  It  also  requires  the  choice  of  a  heuristic  to  determine  when 
a  step  length  is  too  small.  As  we  discuss  next,  the  trust-region  approach  always  generates 
productive  steps  and  needs  no  safeguarding. 


GLOBAL  CONVERGENCE  OF  THE  TRUST-REGION  APPROACH 

The  interior-point  trust-region  method  specified  in  Algorithm  19.4  has  favorable 
global  convergence  properties,  which  we  now  discuss.  For  simplicity,  we  present  the  analysis 
in  the  context  of  inequality-constrained  problems  of  the  form  (19.43).  We  first  study  the 
solution  of  the  barrier  problem  (19.4)  for  a  fixed  value  of  /i  ,  and  then  consider  the  complete 
algorithm. 

In  the  result  that  follows,  B \  denotes  the  Hessian  or  a  quasi-Newton  approxima¬ 

tion  to  it.  We  use  the  measure  of  infeasibility  h(x)  —  ||  [c(.r)]“  || ,  where  [y]  =  max{0,  — y}. 
This  measure  vanishes  if  and  only  if  x  is  feasible  for  problem  (19.43).  Note  that  h(x)2  is 
differentiable  and  its  gradient  is 


V[fi(x)2]  =  2A(x)c(x)  . 
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We  say  that  a  sequence  {y.-}  is  asymptotically  feasible  ifc(y)  — >  0.  To  apply  Algorithm  19.4 
to  a  fixed  barrier  problem,  we  dispense  with  the  outer  “repeat”  loop. 

Theorem  19.2. 

Suppose  that  Algorithm  19.4  is  applied  to  the  barrier  problem  (19.4),  that  is,  p  is  fixed 
and  the  inner  “repeat”  loop  is  executed  with  —  0.  Suppose  that  the  sequence  {fk}  is  bounded 
below  and  the  sequences  { V  /> } ,  {y},  [Ak],  and  {Bk}  are  bounded.  Then  one  of  the  following 
three  situations  occurs: 

(i)  The  sequence  {y}  is  not  asymptotically  feasible.  In  this  case,  the  iterates  approach  sta- 
tionarity  of  the  measure  of  infeasibility  h(x)  —  ||c(x)_||,  meaning  that  Akcf  — >  0,  and 
the  penalty  parameters  Vk  tend  to  infinity. 

(ii)  The  sequence  {y }  is  asymptotically  feasible,  but  the  sequence  {(q,  A*)}  has  a  limit  point 
(y.  A)  failing  the  linear  independence  constraint  qualification.  It:  this  situation  also,  the 
penalty  parameters  y-  tend  to  infinity. 

(iii)  The  sequence  {xk}  is  asymptotically  feasible,  and  all  limit  points  of  the  sequence  {(ck ,  A/..)} 
satisfy  the  linear  independence  constraint  qualification.  In  this  case,  the  penalty  parameter 
Vk  is  constant  and  Ck  >  0  for  all  large  indices  k,  and  the  stationarity  conditions  of problem 
(19.4)  are  satisfied  in  the  limit. 

This  theorem  is  proved  in  [48],  where  it  is  assumed,  for  simplicity,  that  E  is  given 
by  the  primal  choice  (19.14).  The  theorem  accounts  for  two  situations  in  which  the  KKT 
conditions  may  not  be  satisfied  in  the  limit,  both  of  which  are  of  interest.  Outcome  (i)  is 
a  case  in  which,  in  the  limit,  there  is  no  direction  that  improves  feasibility  to  first  order. 
This  outcome  cannot  be  ruled  out  because  finding  a  feasible  point  is  a  problem  that  a  local 
method  cannot  always  solve  without  a  good  starting  point.  (Note  that  we  do  not  assume 
that  the  constraint  Jacobian  Ak  has  full  rank.) 

In  considering  outcome  (ii),  we  must  keep  in  mind  that  in  some  cases  the  solution  to 
problem  (19.43)  is  a  point  where  the  linear  independence  constraint  qualification  fails  and 
that  is  not  a  KKT  point.  Outcome  (iii)  is  the  most  desirable  outcome  and  can  be  monitored 
in  practice  by  observing,  for  example,  the  behavior  of  the  penalty  parameter  Vk  ■ 

We  now  study  the  complete  interior-point  method  given  in  Algorithm  19.4  applied 
to  the  nonlinear  programming  problem  (19.43).  By  combining  Theorems  19.1  and  19.2  we 
see  that  the  following  outcomes  can  occur: 

•  For  some  barrier  parameter  p  generated  by  the  algorithm,  either  the  inequality  ||q  — 
Si||  <  e fj  is  never  satisfied,  in  which  case  the  stationarity  condition  for  minimizing 
h  ( x )  is  satisfied  in  the  limit,  or  else  (y-  —  Sk )  ->  0,  in  which  case  the  sequence  { (<4 ,  Ak ) } 
has  a  limit  point  (c,  A)  failing  the  linear  independence  constraint  qualification; 

•  At  each  outer  iteration  of  Algorithm  19.4  the  inner  stop  test  E(xk,  Sk,  yk,  Zk ;  P)  < 

is  satisfied.  Then  all  limit  points  of  the  iteration  sequence  are  feasible.  Furthermore, 
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if  any  limit  point  x  satisfies  the  linear  independence  constraint  qualification,  the 
first-order  necessary  conditions  for  problem  (19.43)  hold  at  x. 


19.8  SUPERLINEAR  CONVERGENCE 


We  can  implement  primal-dual  interior-point  methods  so  that  they  converge  quickly  near 
the  solution.  All  is  needed  is  that  we  carefully  control  the  decrease  in  the  barrier  parameter 
p  and  the  inner  convergence  tolerance  and  let  the  parameter  r  in  (19.9)  converge  to  1 
sufficiently  rapidly.  We  now  describe  strategies  for  updating  these  parameters  in  the  context 
of  the  line  search  iteration  discussed  in  Section  19.4;  these  strategies  extend  easily  to  the 
trust-region  method  of  Section  19.5. 

In  the  discussion  that  follows,  we  assume  that  the  merit  function  or  filter  is  inactive. 
This  assumption  is  realistic  because  with  a  careful  implementation  (which  may  include 
second-order  correction  steps  or  other  features),  we  can  ensure  that,  near  a  solution,  all  the 
steps  generated  by  the  primal-dual  method  are  acceptable  to  the  merit  function  or  filter. 

We  denote  the  primal- dual  iterates  by 

v  —  (x,  s,  y,  z)  (19.55) 

and  define  the  full  primal-dual  step  (without  backtracking)  by 

v+  =  v  +  p,  (19.56) 

where  p  is  the  solution  of  (19.12).  To  establish  local  convergence  results,  we  assume  that  the 
iterates  converge  to  a  solution  point  satisfying  certain  regularity  assumptions. 

Assumptions  19.1. 

(a)  v *  is  a  solution  of  the  nonlinear  program  (19.1)  for  which  the  first-order  KKT  conditions 
are  satisfied. 

(b)  The  Hessian  matrices  V2/(x)  and  V2c,- (x),i  e  £  U  I,  are  locally  Lipschitz  continuous 
atv*. 

(c)  The  linear  independence  constraint  qualification  (LICQ)  (Definition  12.4),  the  strict 
complementarity  condition  (Definition  12.5),  and  the  second-order  sufficient  conditions 
(Theorem  12.6)  hold  at  v*. 

We  assume  that  v  is  an  iterate  at  which  the  inner  stop  test  E( v,  pt)  <  is  satisfied, 
so  that  the  barrier  parameter  is  decreased  from  p  to  p+ .  We  now  study  how  to  control  the 
parameters  in  Algorithm  19.2  so  that  the  following  three  properties  hold  in  a  neighborhood 
of  v*: 
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1.  The  iterate  i>+  satisfies  the  fraction  to  the  boundary  rule  ( 19.9),  that  is,  =  1. 

2.  The  inner  stop  test  is  satisfied  at  v+,  that  is,  £(i>+;  /x+)  <  eM+. 

3.  The  sequence  of  iterates  (19.56)  converge  superlinearly  to  v*. 

We  can  achieve  these  three  goals  by  letting 

€ ^  —  dfx  and  eM+  =  d/x+,  (19.57) 

for  6  >  0,  and  setting  the  other  parameters  as  follows: 

/x+  =  fx1+s,  S  e  (0,  1);  t  =  1  -v?,  p  >  8.  (19.58) 

There  are  other  practical  ways  of  controlling  the  parameters  of  the  algorithm.  For  example, 
we  may  prefer  to  determine  the  change  in  fx  from  the  reduction  achieved  in  the  KKT 
conditions  of  the  nonlinear  program,  as  measured  by  the  function  E.  The  three  results 
mentioned  above  can  be  established  if  the  convergence  tolerance  is  defined  as  in  ( 19.57) 
and  if  we  replace  n  by  E(v\  0)  in  the  right-hand  sides  of  the  definitions  ( 19.58)  of  fx+  and 
r. 

There  is  a  limit  to  how  fast  we  can  decrease  [x  and  still  be  able  to  satisfy  the  inner  stop 
test  after  just  one  iteration  (condition  2).  One  can  show  that  there  is  no  point  in  decreasing  fx 
at  a  faster  than  quadratic  rate,  since  the  overall  convergence  cannot  be  faster  than  quadratic. 
Not  suprising,  if  r  is  constant  and  jx+  =  a/x,  with  a  e  (0,  1),  then  the  interior-point 
algorithm  is  only  linearly  convergent. 

Although  it  is  desirable  to  implement  interior-point  methods  so  that  they  achieve  a 
superlinear  rate  of  convergence,  this  rate  is  typically  observed  only  in  the  last  few  iterations 
in  practice. 


1 9.9  PERSPECTIVES  AND  SOFTWARE 


Software  packages  that  implement  nonlinear  interior-point  methods  are  widely  available. 
Line  search  implementations  include  loqo  [294],  knitro/direct  [303],  ipopt  [301],  and 
barnlp  [21],  and  for  convex  problems,  mosek  [5].  The  trust-region  algorithm  discussed  in 
Section  19.5  has  been  implemented  in  knitro/cg  [50].  These  interior-point  packages  have 
proved  to  be  strong  competitors  of  the  leading  active-set  and  augmented  Lagrangian  pack¬ 
ages,  such  as  minos  [218],  snopt  [128],  Lancelot  [72],  filtersqp  [105],  and  knitro/active 
[49].  At  present,  interior-point  and  active-set  methods  appear  to  be  the  most  promising 
approaches,  while  augmented  Lagrangian  methods  seem  to  be  less  efficient.  The  knitro 
package  provides  crossover  from  interior-point  to  active-set  modes  [46] . 

Interior-point  methods  show  their  strength  in  large-scale  applications,  where  they 
often  (but  not  always)  outperform  active-set  methods.  In  interior-point  methods,  the  linear 
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system  to  be  solved  at  every  iteration  has  the  same  block  structure,  so  effort  can  be  focused 
on  exploiting  this  structure.  Both  direct  factorization  techniques  and  projected  CG  methods 
are  available,  allowing  the  user  to  solve  many  types  of  applications  efficiently.  On  the  other 
hand,  interior-point  methods,  unlike  active-set  methods,  consider  all  the  constraints  at  each 
iteration,  even  if  they  are  irrelevant  to  the  solution.  As  a  result,  the  cost  of  the  primal-dual 
iteration  can  be  excessive  in  some  applications. 

One  of  the  main  weaknesses  of  interior-point  methods  is  their  sensitivity  to  the  choice 
of  the  initial  point,  the  scaling  of  the  problem,  and  the  update  strategy  for  the  barrier  parame¬ 
ter  fi.  If  the  iterates  approach  the  boundary  of  the  feasible  region  prematurely,  interior-point 
methods  may  have  difficulty  escaping  it,  and  convergence  can  be  slow.  The  availability  of 
adaptive  strategies  for  updating  fi  is,  however,  beginning  to  lessen  this  sensitivity,  and  more 
robust  implementations  can  be  expected  in  the  coming  years. 

Although  the  description  of  the  line  search  algorithm  in  Section  19.4  is  fairly  complete, 
various  details  of  implementation  (such  as  second-order  corrections,  iterative  refinement, 
and  resetting  of  parameters)  are  needed  to  obtain  a  robust  code.  Our  description  of  the  trust- 
region  method  of  Algorithm  19.4  leaves  some  important  details  unspecified,  particularly 
concerning  the  procedure  for  computing  approximate  solutions  of  the  normal  and  tangential 
subproblems;  see  [50]  for  further  discussion.  The  knitro/cg  implementation  of  this  trust- 
region  algorithm  uses  a  projected  CG  iteration  in  the  computation  of  the  step,  which  allows 
the  method  to  work  even  when  only  Hessian-vector  products  are  available,  not  the  Hessian 
itself. 

Filters  and  merit  functions  have  each  been  used  to  globalize  interior-point  methods. 
Although  some  studies  have  shown  that  merit  functions  restrict  the  progress  of  the  iteration 
unduly  [298],  recent  developments  in  penalty  update  procedures  (see  Chapter  18)  have 
altered  the  picture,  and  it  is  currently  unclear  whether  filter  globalization  approaches  are 
preferable. 

NOTES  AND  REFERENCES 

The  development  of  modern  nonlinear  interior-point  methods  was  influenced  by  the 
success  of  interior-point  methods  for  linear  and  quadratic  programming.  The  concept  of 
primal-dual  steps  arises  from  the  homotopy  formulation  given  in  Section  19.1,  which  is 
an  extension  of  the  systems  (14.13)  and  (16.57)  for  linear  and  quadratic  programming. 
Although  the  primal  barrier  methods  of  Section  19.6  predate  primal-dual  methods  by  at 
least  15  years,  they  played  a  limited  role  in  their  development. 

There  is  a  vast  literature  on  nonlinear  interior-point  methods.  We  refer  the  reader 
to  the  surveys  by  Forsgren,  Gill,  and  Wright  [111]  and  Gould,  Orban,  and  Toint  [147]  for 
a  comprehensive  list  of  references.  The  latter  paper  also  compares  and  contrasts  interior- 
point  methods  with  other  nonlinear  optimization  methods.  For  an  analysis  of  interior-point 
methods  that  use  filter  globalization  see,  for  example,  Ulbrich,  Ulbrich,  and  Vicente  [291] 
and  Wachter  and  Biegler  [300] .  The  book  by  Conn,  Gould,  and  Toint  [74]  gives  a  thorough 
presentation  of  several  interior-point  methods. 
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Primal  barrier  methods  were  originally  proposed  by  Frisch  [115]  and  were  analyzed 
in  an  authoritative  book  by  Fiacco  and  McCormick  [98] .  The  term  “interior-point  method” 
and  the  concept  of  the  primal  central  path  Cp  appear  to  have  originated  in  this  book.  Nesterov 
and  Nemirovskii  [226]  propose  and  analyze  several  families  of  barrier  methods  and  establish 
polynomial-time  complexity  results  for  very  general  classes  of  problems  such  as  semidefinite 
and  second-order  cone  programming.  For  a  discussion  of  the  history  of  barrier  function 
methods,  see  Nash  [221], 


&  Exercises 

&  19.1  Consider  the  nonlinear  program 


min  f(x)  subjectto  cE(x)  =  0,  c,(x)  >  0.  (19.59) 

(a)  Write  down  the  KKT  conditions  of  (19.1)  and  (19.59),  and  establish  a  one-to-one 
correspondence  between  KKT  points  of  these  problems  (despite  the  different  numbers 
of  variables  and  multipliers). 

(b)  The  multipliers  z  correspond  to  the  equality  constraints  (19.1c)  and  should  therefore 
be  unsigned.  Nonetheless,  argue  that  (19.2)  with  n  =  0  together  with  (19.3)  can  be 
seen  as  the  KKT  conditions  of  problem  (19.1).  Moreover,  argue  that  the  multipliers  z 
in  ( 19.2)  can  be  seen  as  the  multipliers  of  the  inequalities  c,  in  (19.59). 

(c)  Suppose  x  is  feasible  for  (19.59).  Show  that  LICQ  holds  at  x  for  (19.59)  if  and  only  if 
LICQ  holds  at  (x,  s)  for  (19.1),  with  s  —  c,(x). 

(d)  Repeat  part  (c)  assuming  that  the  MFCQ  condition  holds  (see  Definition  12.6)  instead 
of  LICQ. 

&  19.2  This  question  concerns  Algorithm  19.1. 

(a)  Extend  the  proof  of  Theorem  19.1  to  the  general  nonlinear  program  (19.1). 

(b)  Show  that  the  theorem  still  holds  if  the  condition  E  (xk  ,Sk,yk,Zk )  <  /At  is  replaced  by 
E{xk,  Sk,  yk,  Zk )  <  €/j.k,  for  any  sequence  that  converges  to  0  as  pk  — >  0. 

(c)  Suppose  that  in  Algorithm  19.1  the  new  iterate  (xk+ 1,  s*+i,  »+] ,  Zk+i)  is  obtained  by 
any  means.  What  conditions  are  required  on  this  iterate  so  that  Theorem  19.1  holds? 

&  19.3  Consider  the  nonlinear  system  of  equations  (11.1).  Show  that  Newton’s  method 

( 1 1.6)  is  invariant  to  scalings  of  the  equations.  More  precisely,  show  that  the  Newton  step  p 

does  not  change  if  each  component  of  r  is  multiplied  by  a  nonzero  constant. 


19.9.  Perspectives  and  Software  595 


19.4  Consider  the  system 


x\  +  X2  —  2  =  0, 
X\X2  —  2x1  +1  =  0. 


Find  all  the  solutions  to  this  system.  Show  that  if  the  first  equation  is  multiplied  by  x2>  the 
solutions  do  not  change  but  the  Newton  step  taken  from  ( 1 ,  —  1 )  will  not  be  the  same  as 
that  for  the  original  system. 

i#7  19.5  Let  (x,s,y,z)  be  a  primal-dual  solution  that  satisfies  the  LICQ  and  strict 

complementarity  conditions. 

(a)  Give  conditions  on  c/;(x,  s,  y,  z)  thatensure  that  the  primal-dual  matrix  in  ( 19.6) 
is  nonsingular. 

(b)  Show  that  some  diagonal  elements  of  E  tend  to  infinity  and  others  tend  to  zero  when 
/x  -a  0.  Can  you  characterize  each  case?  Consider  the  cases  in  which  £  is  defined  by 
(19.13)  and  (19.14). 

(c)  Argue  that  the  matrix  in  (19.6)  is  not  ill  conditioned  under  the  assumptions  of  this 
problem. 

i#7  19.6 

(a)  Introduce  the  change  of  variables  ps  —  S~l ps  in  (19.12),  and  show  that  the  (2,  2) 
block  of  the  primal-dual  matrix  has  a  cluster  of  eigenvalues  around  0  when  p  0. 

(b)  Analyze  the  eigenvalue  distribution  of  the  (2,2)  block  if  the  change  of  variables  is  given 
by  ps  =  £1/2/a  or  ps  =  JJZS~lps. 

(c)  Lety  >  0  be  the  smallest  eigenvalue  of  V2XC£.  Describe  a  change  of  variables  for  which 
all  the  eigenvalues  of  the  (2,  2)  block  converge  to  y  as  p  ->  0. 

i#7  19.7  Program  the  simple  interior-point  method  Algorithm  19.1  and  apply  it  to  the 

problem  (18.69).  Use  the  same  starting  point  as  in  that  problem.  Try  different  values  for  the 
parameter  a . 

&  19.8 

(a)  Compute  the  minimum-norm  solution  of  the  system  of  equations  defined  by  (19.35). 
(This  system  defines  the  Newton  component  in  the  dogleg  method  used  to  find  an  ap¬ 
proximate  solution  to  ( 19.34).)  Show  that  the  computation  of  the  Newton  component 
can  use  the  factorization  of  the  augmented  matrix  defined  in  (19.36). 

(b)  Compute  the  unconstrained  minimizer  of  the  quadratic  in  (19.34a)  along  the  steepest 
descent  direction,  starting  from  v  —  0.  (This  minimizer  defines  the  Cauchy  component 
in  the  dogleg  method  used  to  find  an  approximate  solution  to  (19.34).) 
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(c)  The  dogleg  step  is  a  combination  of  the  Newton  and  Cauchy  steps  from  parts  (a)  and 
(b).  Show  that  the  dogleg  step  is  in  the  range  space  of  AT . 

&  19.9 

(a)  If  the  normal  subproblem  ( 19. 34a)-(  19.34c)  is  solved  by  using  the  dogleg  method, 
show  that  the  solution  v  is  in  the  range  space  of  matrix  A1  defined  in  (19.36). 

(b)  After  the  normal  step  v  is  obtained,  we  define  the  residual  vectors  rE  and  r,  as  in 
(19.35)  and  w  —  p  —  v.  Show  that  ( 19.33)  becomes  a  quadratic  program  with  circular 
trust-region  constraint  and  bound  constraint  in  the  variables  w. 

(c)  Show  that  the  solution  w  of  the  problem  derived  in  part  (b)  is  orthogonal  to  the  normal 
step  v,  that  is,  that  wTv  —  0. 

&  19.10  Verify  that  the  least-squares  multiplier  formula  (18.21)  corresponding  to 

(19.33a)-(  19.33c)  is  given  by  (19.37). 

&  19.11 

(a)  Write  the  primal-dual  system  (19.6)  for  problem  (19.53),  considering  s\,  S2  as  slacks 
and  denoting  the  multipliers  of  (19.53b),  (19.53c)  by  Zi,  Zi •  (You  should  get  a  system 
of  five  equations  with  five  unknowns.)  Show  that  the  matrix  of  the  system  is  singular 
at  any  iterate  of  the  form  (x,  0,  0). 

(b)  Show  that  if  the  starting  point  in  Example  (19.53)  lies  in  the  region  (19.54),  the 
interior-point  step  leads  to  a  point  on  the  tangent  line  to  the  parabola,  as  illustrated  in 
Figure  19.2.  (More  specifically,  show  that  the  tangent  line  never  lies  to  the  left  of  the 
parabola.) 

(c)  Let  „c(0)  =  —2,  s’}01  =  1,  ij0*  =  1,  let  =  zf  ’  —  1,  and  let  /x  =  0.  Compute  the  full 
Newton  step  based  on  the  system  in  part  (a).  Truncate,  if  necessary,  to  satisfy  a  fraction 
to  the  boundary  rule  with  r  =  1 .  Verify  that  the  new  iterate  is  still  in  the  region  ( 1 9.54) . 

(d)  Let  us  the  consider  the  behavior  of  an  SQP  method.  For  the  initial  point  in  (c), 
show  that  the  linearized  constraints  of  problem  (18.56)  (don’t  forget  the  constraints 
Si  >  0,52  >  0)  are  inconsistent.  Therefore,  theSQP  subproblem  (18.11)  is  inconsistent, 
and  a  relaxation  of  the  constraint  of  the  SQP  subproblem  must  be  performed. 

&  19.12  Consider  the  following  problem  in  a  single  variable  x: 

min  x  subject  to  x  >  0,  1  —  x  >  0. 

(a)  Write  the  primal  barrier  function  P(x;  (i)  associated  with  this  problem. 

(b)  Plot  the  barrier  function  for  different  values  of  ji. 
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(c)  Characterize  the  minimizers  of  the  barrier  function  as  a  function  of  /r  and  consider  the 
limit  as  [i  goes  to  0. 

19.13  Consider  the  scalar  minimization  problem 

min  - subject  to  x  >  1. 

X  1+X2 


Write  down  P{x\  fx)  for  this  problem,  and  show  that  P{x\  fx)  is  unbounded  below  for  any 
positive  value  of  /x.  (See  Powell  [242]  and  M.  Wright  [313].) 

i#7  19.14 

The  goal  of  this  exercise  is  to  describe  an  efficient  implementation  of  the  limited- 
memory  BFGS  version  of  the  interior-point  method  using  the  compact  representation 
(19.29).  First  we  decompose  the  primal-dual  matrix  as 
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(19.60) 


Use  the  Sherman-Morrison- Woodbury  formula  to  express  the  inverse  (19.60).  Then  show 
that  the  primal-dual  step  (19.12)  requires  the  solution  of  systems  of  the  form  Ci>  =  b,  where 
C  is  the  left  matrix  in  (19.60)  and  v  and  b  are  certain  vectors. 
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A-1  ELEMENTS  OF  LINEAR  ALGEBRA 

VECTORS  AND  MATRICES 

In  this  book  we  work  exclusively  with  vectors  and  matrices  whose  components  are 
real  numbers.  Vectors  are  usually  denoted  by  lowercase  roman  characters,  and  matrices  by 
uppercase  roman  characters.  The  space  of  real  vectors  of  length  n  is  denoted  by  R" ,  while 
the  space  of  real  m  x  n  matrices  is  denoted  by  R'"x'!. 
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Given  a  vector  x  e  R",  we  use  x,  to  denote  its  ith  component.  We  invariably  assume 
that  x  is  a  column  vector,  that  is, 


Xl 

X2 

X  — 

x„ 

The  transpose  of  x,  denoted  by  xT  is  the  row  vector 

xT  =  [  X\  x2  ■■■  xn  ] , 


and  is  often  also  written  with  parentheses  as  x  —  (xi ,  x2, . . . ,  x„ ).  We  write  x  >  0  to  indicate 
componentwise  nonnegativity,  that  is,  x,-  >  0  for  all  i  —  1,2, ... ,  n,  while  x  >  0  indicates 
that  Xi  >  0  for  all  i  =  1,2 

Given  x  e  R"  and  y  e  R”,  the  standard  inner  product  is  xTy  =  x,y,. 

Given  a  matrix  A  e  R"!X,!,  we  specify  its  components  by  double  subscripts  as  A  jj , 
for  i  —  1,2, ,  m  and  j  =  1,2, ... ,  n.  The  transpose  of  A,  denoted  by  AT ,  is  the  n  x  m 
matrix  whose  components  are  Aj,.  The  matrix  A  is  said  to  be  square  if  m  =  n.  A  square 
matrix  is  symmetric  if  A  =  A7 . 

A  square  matrix  A  is  positive  definite  if  there  is  a  positive  scalar  a  such  that 

xTAx>axTx,  forallxeR".  (A.1) 


It  is  positive  semidefinite  if 


xT  Ax  >  0,  for  all  x  e  R". 

We  can  recognize  that  a  symmetric  matrix  is  positive  definite  by  computing  its  eigenvalues 
and  verifying  that  they  are  all  positive,  or  by  performing  a  Cholesky  factorization.  Both 
techniques  are  discussed  further  in  later  sections. 

The  diagonal  of  the  matrix  A  e  R"'x"  consists  of  the  elements  An,  for  i  = 
1,2, ...  min(/7i,  n).  The  matrix  A  e  Rmx"  is  lower  triangular  if  A,j  —  Owhenever/  <  y;that 
is,  all  elements  above  the  diagonal  are  zero.  It  is  upper  triangular  if  A =  0  whenever  i  >  j; 
that  is,  all  elements  below  the  diagonal  are  zero.  A  is  diagonal  if  =  0  whenever  i  fi-  j. 

The  identity  matrix,  denoted  by  /,  is  the  square  diagonal  matrix  whose  diagonal 
elements  are  all  1 . 

A  square  n  x  n  matrix  A  is  nonsingular  if  for  any  vector  b  e  R'\  there  exists  x  e  R" 
such  that  Ax  —  b.  For  nonsingular  matrices  A,  there  exists  a  unique  n  x  n  matrix  B  such 
that  AB  =  BA  =  I .  We  denote  B  by  A~1  and  call  it  the  inverse  of  A.  It  is  not  hard  to  show 
that  the  inverse  of  A7  is  the  transpose  of  A-1. 

A  square  matrix  Q  is  orthogonal  if  it  has  the  property  that  QQ 7  —  Q7  Q  =  I.  In 
other  words,  the  inverse  of  an  orthogonal  matrix  is  its  transpose. 
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NORMS 


For  a  vector  i  e  E",  we  define  the  following  norms: 


Mi  =  X!  lXi 

i= 1 


n 


1/2 


=  E*«2  =  <* 


rx)1/2. 


\i=l 


II  II  def  I  I 

Halloo  =  max  \Xi\. 


(A.2a) 


(A.2b) 


(A.2c) 


The  norm  ||  •  H2  is  often  called  the  Euclidean  norm.  We  sometimes  refer  to  ||  •  ||i  as  the  i\ 
norm  and  to  ||  •  as  the  norm.  All  these  norms  measure  the  length  of  the  vector  in 
some  sense,  and  they  are  equivalent  in  the  sense  that  each  one  is  bounded  above  and  below 
by  a  multiple  of  the  other.  To  be  precise,  we  have  for  all  x  e  R"  that 


\\x\\oo  <  WI2  <  Vn\\x\ 


II 00  <  ||*||i  <  n\\x\\ 


(A.3) 


and  so  on.  In  general,  a  norm  is  any  mapping  ||  •  ||  from  R'!  to  the  nonnegative  real  numbers 
that  satisfies  the  following  properties: 


II*  +  zll  <  11*11  +  Ikll,  for  all x,  z  e  R"; 

|| jc ||  =  0  x  —  0; 

||ax||  =  |ar  |  II*  ||,  for  all  a  e  Rand*  e  R". 


(A.4a) 

(A.4b) 

(A.4c) 


Equality  holds  in  (A.4a)  if  and  only  if  one  of  the  vectors  *  and  z  is  a  nonnegative  scalar 
multiple  of  the  other. 

Another  interesting  property  that  holds  for  the  Euclidean  norm  ||  •  ||  =  ||  •  ||2  is  the 
Cauchy-Schwarz  inequality,  which  states  that 

\xTz\  <  ||*||  ||z||,  (A.5) 

with  equality  if  and  only  if  one  of  these  vectors  is  a  nonnegative  multiple  of  the  other.  We 
can  prove  this  result  as  follows: 

0  <  || ax  +  z||2  =  cv2 1|* ||2  +  2 axTz  +  ||z||2- 


The  right-hand-side  is  a  convex  function  of  a,  and  it  satisfies  the  required  nonnegativity 
property  only  if  there  exist  fewer  than  2  distinct  real  roots,  that  is, 

(2 xTz)2  <  4||x||2||z||2, 
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proving  (A.5).  Equality  occurs  when  the  quadratic  a  has  exactly  one  real  root  (that  is, 
\xT z\  —  ||;c||  ||z||)  and  when  ax  +  z  —  0  for  some  a,  as  claimed. 

Any  norm  ||  •  ||  has  a  dual  norm  ||  •  ||d  defined  by 

II*  Ho  =  maxxry.  (A.6) 

lb  ll=i  ‘ 

It  is  easy  to  show  that  the  norms  ||  •  ||i  and  II  '  II OO  are  duals  of  each  other,  and  that  the 
Euclidean  norm  is  its  own  dual. 

We  can  derive  definitions  for  certain  matrix  norms  from  these  vector  norm  defini¬ 
tions.  If  we  let  ||  •  ||  be  generic  notation  for  the  three  norms  listed  in  (A.2),  we  define  the 
corresponding  matrix  norm  as 


H  .  I,  def  II  A*|| 

llAll  =  sup  — —  ■ 
*A0  I A II 


(A.7) 


The  matrix  norms  defined  in  this  way  are  said  to  be  consistent  with  the  vector  norms  (A.2). 
Explicit  formulae  for  these  norms  are  as  follows: 


l|A||i  =  max  V|A0|, 

||  A  || 2  =  largest  eigenvalue  of  (A7  A)1  ^2, 

n 

Alloo  =  max  V|A„.|. 

* — ' 

j= i 


The  Frobenius  norm  ||  A  ||  p  of  the  matrix  A  is  defined  by 


(m  n 

EEal 

'  =  1  7=1 


1/2 


(A.8a) 

(A.8b) 

(A.8c) 


(A.9) 


This  norm  is  useful  for  many  purposes,  but  it  is  not  consistent  with  any  vector  norm.  Once 
again,  these  various  matrix  norms  are  equivalent  with  each  other  in  a  sense  similar  to  (A.3). 
For  the  Euclidean  norm  ||  •  ||  =  ||  •  || 2,  the  following  property  holds: 


AB ||  <  || A||  ||fl 


(A.  10) 


for  all  matrices  A  and  B  with  consistent  dimensions. 

The  condition  number  of  a  nonsingular  matrix  is  defined  as 


(A.l  1) 
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where  any  matrix  norm  can  be  used  in  the  definition.  Different  norms  can  by  the  use  of  a 
subscript — /C2 ( - ) >  and  k0 0(-)>  respectively — with  k  denoting  k2  by  default. 

Norms  also  have  a  meaning  for  scalar,  vector,  and  matrix-valued  functions  that  are 
defined  on  a  particular  domain.  In  these  cases,  we  can  define  Hilbert  spaces  of  functions  for 
which  the  inner  product  and  norm  are  defined  in  terms  of  an  integral  over  the  domain.  We 
omit  details,  since  all  the  development  of  this  book  takes  place  in  the  space  IR" ,  though  many 
of  the  algorithms  can  be  extended  to  more  general  Hilbert  spaces.  However,  we  mention 
for  purposes  of  the  analysis  of  Newton-like  methods  that  the  following  inequality  holds  for 
functions  of  the  type  that  we  consider  in  this  book: 


m 


<  f  \\F{t)\\dt , 


(A.12) 


where  F  is  a  continuous  scalar-,  vector-,  or  matrix- valued  function  on  the  interval  [a,  b\. 


SUBSPACES 

Given  the  Euclidean  space  R" ,  the  subset  S  C  R"  is  a  subspace  o/R"  if  the  following 
property  holds:  If  x  and  y  are  any  two  elements  of  S,  then 

ax  +  fiy  e  S ,  for  all  a,  /3  e  R. 

For  instance,  S  is  a  subspace  of  R2  if  it  consists  of  (i)  the  whole  space  R” ;  (ii)  any  line  passing 
through  the  origin;  (iii)  the  origin  alone;  or  (iv)  the  empty  set. 

Given  any  set  of  vectors  a ;  e  R",  i  —  1,2,...,  m,  the  set 

5=)weR"  aj  w  =  0,  i  —  1,  2, . . . ,  m\  (A.13) 


is  a  subspace.  However,  the  set 

{ w  e  R”  |  af  w  >  0,  i  —  1,  2, . . . ,  m}  (A.  14) 

is  not  in  general  a  subspace.  For  example,  if  we  have  u  =2,m  —  l,andui  =  (1,  0)r,  this  set 
would  consist  of  all  vectors  (nq,  W2Y  with  uq  >0,  but  then  given  two  vectors  x  =  (1,  0)r 
and  y  —  (2,  3)  in  this  set,  it  is  easy  to  choose  multiples  a  and  ft  such  that  ax  +  fiy  has  a 
negative  first  component,  and  so  lies  outside  the  set. 

Sets  of  the  forms  (A.13)  and  (A.  14)  arise  in  the  discussion  of  second-order  optimality 
conditions  for  constrained  optimization. 

A  set  of  vectors  {sr,  $2,  •  •  • ,  s„,}  in  R'!  is  called  a  linearly  independent  set  ii  there  are  no 
real  numbers  U\,U2,  ■  ■  ■  ,am  such  that 


ais2  +  a2s2  H - h  amsm  —  0, 
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unless  we  make  the  trivial  choice  oti  —  012  —  ■  ■  ■  —  otm  —  0.  Another  way  to  define  linear 
independence  is  to  say  that  none  of  the  vectors  s\,  S2, . . . ,  sm  can  be  written  as  a  linear 
combination  of  the  other  vectors  in  this  set.  If  in  fact  we  have  Si  e  S  for  all  i  —  1,2, ,  m, 
we  say  that  {si ,  S2 ,  ■  ■ . ,  sm }  is  a  spanning  set  for  S  if  any  vector  s  e  S  can  be  written  as 

S  —  CX1S2  +  OI2S2  +  •  •  •  +  ocmsm  , 

for  some  particular  choice  of  the  coefficients  oi\,ot2 . am. 

If  the  vectors  S\,S2, ... ,  sm  are  both  linearly  independent  and  a  spanning  set  for  S, 
we  call  them  a  basis  of  S.  In  this  case,  m  (the  number  of  elements  in  the  basis)  is  referred  to 
as  the  dimension  of  S,  and  denoted  by  dim(<S).  Note  that  there  are  many  ways  to  choose  a 
basis  of  S  in  general,  but  that  all  bases  contain  the  same  number  of  vectors. 

If  A  is  any  real  matrix,  the  null  space  is  the  subspace 

Null(A)  =  {u>  |  Aw  =  0}, 


while  the  range  space  is 

Range(A)  —  {w\w  —  Av  for  some  vector  v). 

The  fundamental  theorem  of  linear  algebra  states  that 

Null(A)  ©  Range(Ar)  =  R", 

where  n  is  the  number  of  columns  in  A.  (Here,  “©”  denotes  the  direct  sum  of  two  sets: 
A®  B  =  [x  +  y  \  x  e  A,  ye  £>}.) 

When  A  is  square  ( n  x  n)  and  nonsingular,  we  have  Null  A  =  Null  A T  —  {0}  and 
Range  A  =  Range  A T  —  R".  In  this  case,  the  columns  of  A  form  a  basis  of  R",  as  do  the 
columns  of  A1 . 

EIGENVALUES,  EIGENVECTORS,  AND  THE  SINGULAR-VALUE 
DECOMPOSITION 

A  scalar  value  X  is  an  eigenvalue  of  the  n  x  n  matrix  A  if  there  is  a  nonzero  vector  q 
such  that 


Aq  =  Xq. 


The  vector  q  is  called  an  eigenvector  of  A.  The  matrix  A  is  nonsingular  if  none  of  its 
eigenvalues  are  zero.  The  eigenvalues  of  symmetric  matrices  are  all  real  numbers,  while 
nonsymmetric  matrices  may  have  imaginary  eigenvalues.  If  the  matrix  is  positive  definite  as 
well  as  symmetric,  its  eigenvalues  are  all  positive  real  numbers. 
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All  matrices  A  (not  necessarily  square)  can  be  decomposed  as  a  product  of  three 
matrices  with  special  properties.  When  A  e  R',IX"  with  m  >  n,  (that  is,  A  has  more  rows 
than  columns),  this  singular-value  decomposition  (SVD)  has  the  form 

VT,  (A.15) 

where  U  and  V  are  orthogonal  matrices  of  dimension  m  x  m  and  n  x  n ,  respectively,  and 
S  is  an  n  x  n  diagonal  matrix  with  diagonal  elements  a,,  i  =  1,2 that  satisfy 


or  >  <72  >  •  •  •  >  er„  >  0. 


These  diagonal  values  are  called  the  singular  values  of  A.  We  can  define  the  condition 
number  (A.  11)  of  the  m  x  n  (possibly  nonsquare)  matrix  A  to  be  or /er„.  (This  definition  is 
identical  to  k2(A)  when  A  happens  to  be  square  and  nonsingular.) 

When  m  <  n  (the  number  of  columns  is  at  least  equal  to  the  number  of  rows),  the 
SVD  has  the  form 


A  =  U[  S  0  ]  VT, 

where  again  U  and  V  are  orthogonal  of  dimension  m  x  m  and  n  x  n,  respectively,  while  S 
is  m  x  m  diagonal  with  nonnegative  diagonal  elements  c>\  >  <J2  >  ■  ■  ■  >  am. 

When  A  is  symmetric,  its  n  real  eigenvalues  X\,X2, . . .  ,Xn  and  their  associated 
eigenvectors  ql,  q2,  .  .  .  ,  q, ,  can  be  used  to  write  a  spectral  decomposition  of  A  as  follows: 


a  =  ■ 

1  =  1 

This  decomposition  can  be  restated  in  matrix  form  by  defining 

A  =  diag(M,  X2,  ■  ■  ■ ,  K),  Q  =  [<?i  I  <72  I  •  •  •  I  qn L 

and  writing 

A=QAQt.  (A.16) 

In  fact,  when  A  is  positive  definite  as  well  as  symmetric,  this  decomposition  is  identical  to 
the  singular-value  decomposition  (A.15),  where  we  define  U  =  V  =  Q  and  S  =  A.  Note 
that  the  singular  values  <7;  and  the  eigenvalues  A,-  coincide  in  this  case. 
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In  the  case  of  the  Euclidean  norm  (A.8b),  we  have  for  symmetric  positive  definite 
matrices  A  that  the  singular  values  and  eigenvalues  of  A  coincide,  and  that 

||  A  ||  =  or  (A)  =  largest  eigenvalue  of  A, 

||  A-1 1|  =  an(A)~l  =  inverse  of  smallest  eigenvalue  of  A. 

Hence,  we  have  for  all  x  e  R"  that 

cr„(A)||x||2  =  ||x||2/||A-1||  <  xT Ax  <  ||A||||x||2  =  <7i(A)||x||2. 

For  an  orthogonal  matrix  Q,  we  have  for  the  Euclidean  norm  that 

II  <2*11  =  11*11, 

and  that  all  the  singular  values  of  this  matrix  are  equal  to  1. 


DETERMINANT  AND  TRACE 

The  trace  of  an  n  x  n  matrix  A  is  defined  by 


n 

trace(A)  =  ^  A„. 

i=i 


If  the  eigenvalues  of  A  are  denoted  by  A.i ,  A 2 , . . . ,  k„ ,  it  can  be  shown  that 


(A.  17) 


trace(A)  =  (A.18) 

i=i 

that  is,  the  trace  of  the  matrix  is  the  sum  of  its  eigenvalues. 

The  determinant  of  an  n  x  n  matrix  A,  denoted  by  detA,  is  the  product  of  its 
eigenvalues;  that  is, 


det  A  =  Y\  A  (A-19) 

i= 1 

The  determinant  has  several  appealing  (and  revealing)  properties.  For  instance, 
det  A  =  0  if  and  only  if  A  is  singular; 
detAB  =  (det  A)(detB); 
detA-1  =  1/detA. 
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Recall  that  any  orthogonal  matrix  A  has  the  property  that  QQT  =  QT  Q  =  /,  so  that 
Q~l  —  QT .  It  follows  from  the  property  of  the  determinant  that  det  Q  —  det  QT  —  ±1. 

The  properties  above  are  used  in  the  analysis  of  Chapter  6. 

MATRIX  FACTORIZATIONS:  CHOLESKY,  LU,  QR 

Matrix  factorizations  are  important  both  in  the  design  of  algorithms  and  in  their 
analysis.  One  such  factorization  is  the  singular- value  decomposition  defined  above  in  (A.  1 5). 
Here  we  define  the  other  important  factorizations. 

All  the  factorization  algorithms  described  below  make  use  of  permutation  matrices. 
Suppose  that  we  wish  to  exchange  the  first  and  fourth  rows  of  a  matrix  A.  We  can  perform 
this  operation  by  premultiplying  A  by  a  permutation  matrix  P,  which  is  constructed  by 
interchanging  the  first  and  fourth  rows  of  an  identity  matrix  that  contains  the  same  number 
of  rows  as  A.  Suppose,  for  example,  that  A  is  a  5  x  5  matrix.  The  appropriate  choice  of  P 
would  be 


P  = 


0  0  0  1  0 
0  10  0  0 
0  0  10  0 
1  0  0  0  0 
0  0  0  0  1 


A  similar  technique  is  used  to  to  find  a  permutation  matrix  P  that  exchanges  columns  of  a 
matrix. 

The  LU  factorization  of  a  matrix  A  e  R”x"  is  defined  as 


PA  =  LU. 


(A.20) 


where 


P  is  an  n  x  n  permutation  matrix  (that  is,  it  is  obtained  by  rearranging  the  rows  of 
the  77  x  n  identity  matrix), 

L  is  unit  lower  triangular  (that  is,  lower  triangular  with  diagonal  elements  equal  to  1, 
and 

U  is  upper  triangular. 

This  factorization  can  be  used  to  solve  a  linear  system  of  the  form  Ax  —  b  efficiently  by  the 
following  three-step  process: 

form  b  —  Pb  by  permuting  the  elements  of  b; 

solve  Lz  —  b  by  performing  triangular  forward-substitution,  to  obtain  the  vector  z; 
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solve  Ux  —  z  by  performing  triangular  back-substitution,  to  obtain  the  solution 
vector  v. 

The  factorization  (A.20)  can  be  found  by  using  Gaussian  elimination  with  row  partial 
pivoting,  an  algorithm  that  requires  approximately  2h3/3  floating-point  operations  when  A 
is  dense.  Standard  software  that  implements  this  algorithm  (notably,  LAPACK  [7] )  is  readily 
available.  The  method  can  be  stated  as  follows. 

Algorithm  A.1  (Gaussian  Elimination  with  Row  Partial  Pivotins). 

Given  A  e  R"xn; 

Set  P  4-  I,  L  <—  0; 
for  i  =  1,2 , ,n 

find  the  index  j  e  {/,  i  +  1,  . . . ,  n]  such  that  |A;-,-|  —  max*=Iil+i „  | 1 ; 

if  Aij  —  0 

stop;  (*  matrix  A  is  singular  *) 

if  i  #  j 

swap  rows  i  and  j  of  matrices  A  and  L; 

(*  elimination  step  *) 

La  <-  1; 

for  k  —  i  +  1,  i  +  2, , . . ,  n 
Pki  ^  A ki l A a\ 

for  /  =  i  +  1,  i  +  2, _ ,  n 

Au  Am  —  LmAh ; 

end  (for) 
end  (if) 
end  (for) 

U  <-  upper  triangular  part  of  A. 


Variants  of  the  basic  algorithm  allow  for  rearrangement  of  the  columns  as  well  as 
the  rows  during  the  factorization,  but  these  do  not  add  to  the  practical  stability  properties 
of  the  algorithm.  Column  pivoting  may,  however,  improve  the  performance  of  Gaussian 
elimination  when  the  matrix  A  is  sparse,  by  ensuring  that  the  factors  L  and  U  are  also 
reasonably  sparse. 

Gaussian  elimination  can  be  applied  also  to  the  case  in  which  A  is  not  square.  When 
A  is  m  x  n,  with  m  >  «,  the  standard  row  pivoting  algorithm  produces  a  factorization  of  the 
form  (A.20),  where  L  e  Rmx,!  is  unit  lower  triangular  and  U  e  R"x"  is  upper  triangular. 
When  m  <  n,we  can  find  an  LU  factorization  of  AT  rather  than  A,  that  is,  we  obtain 


PAt  = 


L  i 
Li 


(A.21) 


where  L\  is  m  x  m  (square)  unit  lower  triangular,  U  is  m  x  m  upper  triangular,  and  L2  is  a 
general  (n  — m )  x  m  matrix.  If  A  has  full  row  rank,  we  can  use  this  factorization  to  calculate 
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its  null  space  explicitly  as  the  space  spanned  by  the  columns  of  the  matrix 

U~T .  (A.22) 

It  is  easy  to  check  that  M  has  dimensions  n  x  ( n  —  m)  and  that  AM  =  0. 

When  A  e  R"x"  is  symmetric  positive  definite,  it  is  possible  to  compute  a  similar 
but  more  specialized  factorization  at  about  half  the  cost — about  n  5/3  operations.  This 
factorization,  known  as  the  Cholesky  factorization,  produces  a  matrix  L  such  that 

A  =  LLT .  (A.23) 

(If  we  require  L  to  have  positive  diagonal  elements,  it  is  uniquely  defined  by  this  formula.) 
The  algorithm  can  be  specified  as  follows. 

Algorithm  A.2  (Cholesky  Factorization). 

Given  A  e  R"x,!  symmetric  positive  definite; 
for  i  —  1,2, ,  n; 

La  <  a/ A , , ; 

for  j  —  i  +  1,  i  +  2, . . . ,  n 

Lji  <  Aji/Ln ; 

for  k  —  i  +  1 ,  i  +  2, . . . ,  j 
Ajk  <-  A  jk  —  L  ji  Lki ; 

end  (for) 
end  (for) 
end  (for) 

Note  that  this  algorithm  references  only  the  lower  triangular  elements  of  A;  in  fact, 
it  is  only  necessary  to  store  these  elements  in  any  case,  since  by  symmetry  they  are  simply 
duplicated  in  the  upper  triangular  positions. 

Unlike  the  case  of  Gaussian  elimination,  the  Cholesky  algorithm  can  produce  a  valid 
factorization  of  a  symmetric  positive  definite  matrix  without  swapping  any  rows  or  columns. 
However,  symmetric  permutation  (that  is,  reordering  the  rows  and  columns  in  the  same 
way)  can  be  used  to  improve  the  sparsity  of  the  factor  L.  In  this  case,  the  algorithm  produces 
a  permutation  of  the  form 


M  —  P 


I  ~T Lt 


ptap  =  llt 


for  some  permutation  matrix  P . 

The  Cholesky  factorization  can  be  used  to  compute  solutions  of  the  system  Ax  =  b 
by  performing  triangular  forward-  and  back-substitutions  with  L  and  LT ,  respectively,  as 
in  the  case  of  L  and  U  factors  produced  by  Gaussian  elimination. 
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The  Cholesky  factorization  can  also  be  used  to  verify  positive  definiteness  of  a  sym¬ 
metric  matrix  A.  If  Algorithm  A.2  runs  to  completion  with  all  La  values  well  defined  and 
positive,  then  A  is  positive  definite. 

Another  useful  factorization  of  rectangular  matrices  A  e  R"!X"  has  the  form 


AP  =  QR, 


(A.24) 


where 


P  is  an  77  x  n  permutation  matrix, 

A  is  m  x  m  orthogonal,  and 
R  is  m  x  n  upper  triangular. 

In  the  case  of  a  square  matrix  m  —  n,  this  factorization  can  be  used  to  compute  solutions  of 
linear  systems  of  the  form  Ax  =  b  via  the  following  procedure: 

set  b  —  QTb ; 

solve  Rz  =  b  for  z  by  performing  back-substitution; 

set  x  —  PTz  by  rearranging  the  elements  of  x. 

For  a  dense  matrix  A,  the  cost  of  computing  the  QR  factorization  is  about  4  m2n /3  operations. 
In  the  case  of  a  square  matrix,  the  operation  count  is  about  twice  as  high  as  for  an  LU 
factorization  via  Gaussian  elimination.  Moreover,  it  is  more  difficult  to  maintain  sparsity  in 
a  QR  factorization  than  in  an  LU  factorization. 

Algorithms  to  perform  QR  factorization  are  almost  as  simple  as  algorithms  for  Gaus¬ 
sian  elimination  and  for  Cholesky  factorization.  The  most  widely  used  algorithms  work 
by  applying  a  sequence  of  special  orthogonal  matrices  to  A,  known  either  as  Householder 
transformations  or  Givens  rotations,  depending  on  the  algorithm.  We  omit  the  details,  and 
refer  instead  to  Golub  and  Van  Loan  [136,  Chapter  5]  for  a  complete  description. 

In  the  case  of  a  rectangular  matrix  A  with  m  <  n,  we  can  use  the  QR  factorization  of 
At  to  find  a  matrix  whose  columns  span  the  null  space  of  A.  To  be  specific,  we  write 

atp  =  qr  =  [  q,  q2  ]r, 

where  Qi  consists  of  the  first  m  columns  of  Q,  and  Q2  contains  the  last  n  —  m  columns.  It  is 
easy  to  show  that  columns  of  the  matrix  Q2  span  the  null  space  of  A.  This  procedure  yields 
a  more  satisfactory  basis  matrix  for  the  null  space  than  the  Gaussian  elimination  procedure 
(A.22),  because  the  columns  of  Q2  are  orthogonal  to  each  other  and  have  unit  length.  It 
may  be  more  expensive  to  compute,  however,  particularly  in  the  case  in  which  A  is  sparse. 

When  A  has  full  column  rank,  we  can  make  an  identification  between  the  R  factor  in 
(A.24)  and  the  Cholesky  factorization.  By  multiplying  the  formula  (A.24)  by  its  transpose, 
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we  obtain 


PtAtAP  =  RTQT  QR  =  RtR , 


and  by  comparison  with  (A.23),  weseethat  RT  is  simply  the  Cholesky  factor  of  the  symmetric 
positive  definite  matrix  PT  AT  AP.  Recalling  that  L  is  uniquely  defined  when  we  restrict  its 
diagonal  elements  to  be  positive,  this  observation  implies  that  R  is  also  uniquely  defined 
for  a  given  choice  of  permutation  matrix  P,  provided  that  we  enforce  positiveness  of  the 
diagonals  of  R.  Note,  too,  that  since  we  can  rearrange  (A.24)  to  read  APR~ 1  =  Q,  we  can 
conclude  that  Q  is  also  uniquely  defined  under  these  conditions. 

Note  that  by  definition  of  the  Euclidean  norm  and  the  property  (A.  10),  and  the  fact 
that  the  Euclidean  norms  of  the  matrices  P  and  Q  in  (A.24)  are  both  1,  we  have  that 

]| A ||  =  \\QRPt\\  <  || g||  ||P ||  ||Pr||  =  ||P||, 


while 


||P ||  =  ||2rAP||  <  ||2r||  || A||  || P* ||  =  || A || . 

We  conclude  from  these  two  inequalities  that  ||  A ||  =  ||P||.  When  A  is  square,  we  have  by  a 
similar  argument  that  ||  A-1 1|  =  ||P_1||.  Hence  the  Euclidean-norm  condition  number  of 
A  can  be  estimated  by  substituting  P  for  A  in  the  expression  (A.  11).  This  observation  is 
significant  because  various  techniques  are  available  for  estimating  the  condition  number  of 
triangular  matrices  P;  see  Golub  and  Van  Loan  [136,  pp.  128-130]  for  a  discussion. 


SYMMETRIC  INDEFINITE  FACTORIZATION 

When  matrix  A  is  symmetric  but  indefinite,  Algorithm  A.2  will  break  down  by  trying 
to  take  the  square  root  of  a  negative  number.  We  can  however  produce  a  factorization, 
similar  to  the  Cholesky  factorization,  of  the  form 

P  APt  =  LBLt,  (A.25) 

where  L  is  unit  lower  triangular,  B  is  a  block  diagonal  matrix  with  blocks  of  dimension  1  or 
2,  and  P  is  a  permutation  matrix.  The  first  step  of  this  symmetric  indefinite  factorization 
proceeds  as  follows.  We  identify  a  submatrix  £  of  A  that  is  suitable  to  be  used  as  a  pivot 
block.  The  precise  criteria  that  can  be  used  to  choose  E  are  described  below,  but  we  note  here 
that  E  is  either  a  single  diagonal  element  of  A  (a  1  x  1  pivot  block),  or  else  the  2x2  block 
consisting  of  two  diagonal  elements  of  A  (say,  an  and  ajj )  along  with  the  corresponding 
off-diagonal  elements  (that  is,  ai;-  and  ayi).  In  either  case,  E  must  be  nonsingular.  We  then 
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find  a  permutation  matrix  P\  that  makes  E  a  leading  principal  submatrix  of  A,  that  is, 


P1AP1  = 


E  CT 
C  H 


(A.26) 


and  then  perform  a  block  factorization  on  this  rearranged  matrix,  using  E  as  the  pivot 
block,  to  obtain 


I 

0 

E 

0 

I 

E~lCT 

CE~l 

/ 

0 

H  -  CE~lCT 

0 

I 

The  next  step  of  the  factorization  consists  in  applying  exactly  the  same  process  to  H  — 
CE~lCT ,  known  as  the  remaining  matrix  or  the  Schur  complement,  which  has  dimension 
either  ( n  —  1)  x  (n  —  1)  or  («  —  2)  x  («  —  2).  We  now  apply  the  same  procedure  recursively, 
terminating  with  the  factorization  (A.25).  Here  P  is  defined  as  a  product  of  the  permutation 
matrices  from  each  step  of  the  factorization,  and  B  contains  the  pivot  blocks  E  on  its 
diagonal. 

The  symmetric  indefinite  factorization  requires  approximately  «3/ 3  floating-point 
operations — the  same  as  the  cost  of  the  Cholesky  factorization  of  a  positive  definite  matrix — 
but  to  this  count  we  must  add  the  cost  of  identifying  suitable  pivot  blocks  E  and  of 
performing  the  permutations,  which  can  be  considerable.  There  are  various  strategies  for 
determining  the  pivot  blocks,  which  have  an  important  effect  on  both  the  cost  of  the 
factorization  and  its  numerical  properties.  Ideally,  our  strategy  for  choosing  E  at  each  step 
of  the  factorization  procedure  should  be  inexpensive,  should  lead  to  at  most  modest  growth 
in  the  elements  of  the  remaining  matrix  at  each  step  of  the  factorization,  and  should  avoid 
excessive  fill-in  (that  is,  L  should  not  be  too  much  more  dense  than  A). 

A  well-known  strategy,  due  to  Bunch  and  Parlett  [43],  searches  the  whole  remaining 
matrix  and  identifies  the  largest-magnitude  diagonal  and  largest-magnitude  off-diagonal 
elements,  denoting  their  respective  magnitudes  by  §dia  and  £0ff.  If  the  diagonal  element 
whose  magnitude  is  §dia  is  selected  to  be  a  1  x  1  pivot  block,  the  element  growth  in  the 
remaining  matrix  is  bounded  by  the  ratio  Idia/toff-  If  this  growth  rate  is  acceptable,  we 
choose  this  diagonal  element  to  be  the  pivot  block.  Otherwise,  we  select  the  off-diagonal 
element  whose  magnitude  is  $Qff  ( a,j ,  say),  and  choose  E  to  be  the  2x2  submatrix  that 
includes  this  element,  that  is, 


L  aU  aJj  J 

This  pivoting  strategy  of  Bunch  and  Parlett  is  numerically  stable  and  guarantees  to  yield  a 
matrix  L  whose  maximum  element  is  bounded  by  2.781.  Its  drawback  is  that  the  evaluation 
of  §dia  an<i  £0fj  at  each  iteration  requires  many  comparisons  between  floating-point  numbers 
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to  be  performed:  0{n 3)  in  total  during  the  overall  factorization.  Since  each  comparison  costs 
roughly  the  same  as  an  arithmetic  operation,  this  overhead  is  not  insignificant. 

The  more  economical  pivoting  strategy  of  Bunch  and  Kaufman  [42]  searches  at  most 
two  columns  of  the  working  matrix  at  each  stage  and  requires  just  0(n2)  comparisons  in 
total.  Its  rationale  and  details  are  somewhat  tricky,  and  we  refer  the  interested  reader  to  the 
original  paper  [42]  or  to  Golub  and  Van  Loan  [136,  Section  4.4]  for  details.  Unfortunately, 
this  algorithm  can  give  rise  to  arbitrarily  large  elements  in  the  lower  triangular  factor  L, 
making  it  unsuitable  for  use  with  a  modified  Cholesky  strategy. 

The  bounded  Bunch-Kaufman  strategy  is  essentially  a  compromise  between  the 
Bunch-Parlett  and  Bunch-Kaufman  strategies.  It  monitors  the  sizes  of  elements  in  L,  ac¬ 
cepting  the  (inexpensive)  Bunch-Kaufman  choice  of  pivot  block  when  it  yields  only  modest 
element  growth,  but  searching  further  for  an  acceptable  pivot  when  this  growth  is  excessive. 
Its  total  cost  is  usually  similar  to  that  of  Bunch-Kaufman,  but  in  the  worst  case  it  can 
approach  the  cost  of  Bunch-Parlett. 

So  far,  we  have  ignored  the  effect  of  the  choice  of  pivot  block  E  on  the  sparsity  of  the 
final  L  factor.  This  consideration  is  important  when  the  matrix  to  be  factored  is  large  and 
sparse,  since  it  greatly  affects  both  the  CPU  time  and  the  amount  of  storage  required  by  the 
algorithm.  Algorithms  that  modify  the  strategies  above  to  take  account  of  sparsity  have  been 
proposed  by  Duff  et  al.  [97],  Duff  and  Reid  [95],  and  Fourer  and  Mehrotra  [113]. 


SHERMAN-MORRISON-WOODBURY  FORMULA 

If  the  square  nonsingular  matrix  A  undergoes  a  rank-one  update  to  become 

A  —  A  +  abT , 


where  a ,  b  e  R" ,  then  if  A  is  nonsingular,  we  have 

-  ,  ,  A~xabJ A-1 

A’1  =  7T1 - — —  . 

1  +  bT A  lci 


(A.27) 


It  is  easy  to  verify  this  formula:  Simply  multiply  the  definitions  of  A  and  A-1  together  and 
check  that  they  produce  the  identity. 

This  formula  can  be  extended  to  higher-rank  updates.  Let  U  and  V  be  matrices  in 
R" x  p  for  some  p  between  1  and  n.  If  we  define 

A  =  A  +  UVT, 


then  A  is  nonsingular  if  and  only  if  ( I  +  V  T  A  1 U )  is  nonsingular,  and  in  this  case  we  have 
A-1  =  A-1  -  A~lU{I  +  VTA-1U)-lVTA~l.  (A.28) 
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We  can  use  this  formula  to  solve  linear  systems  of  the  form  Ax  —  d.  Since 

x  =  A~ld  =  A~ld  -  A'lU(I  +  VTA~1U)-lVTA-ld, 

we  see  that  x  can  be  found  by  solving  p  +  1  linear  systems  with  the  matrix  A  (to  obtain  A~ld 
and  A~1U),  inverting  the  p  x  p  matrix  I  +  VT A~lU ,  and  performing  some  elementary 
matrix  algebra.  Inversion  of  the  p  x  p  matrix  I  +  VT  A~lU  is  inexpensive  when  p  <<C  n. 

INTERLACING  EIGENVALUE  THEOREM 

The  following  result  is  proved  for  example  in  Golub  and  Van  Loan  [136, 
Theorem  8.1.8]. 

Theorem  A.1  (Interlacing  Eisenvalue  Theorem). 

Let  A  e  R”x"  be  a  symmetric  matrix  with  eigenvalues  X 1,  X2, . . . ,  X„  satisfying 

X\  >  X2  >  ■  ■  ■  >  Xn, 

and  let  z  e  R”  be  a  vector  with  ||z||  =  1,  and  a  e  R  be  a  scalar.  Then  if  we  denote  the 
eigenvalues  of  A  +  azzT  by%\,  £2,  •  •  • ,  (in  decreasing  order),  we  have  fora  >  0  that 


§1  >  M  >  §2  >:  X2  >  £3  >  ■  •  •  >  >  Xn, 


with 


T.  Hi  -  h  —  a- 

1  =  1 


(A.29) 


If  a  <  0,  we  have  that 


Xi  >  H\  >  ^-2  >  Hi  >  ^5  >  •  •  •  > 


where  the  relationship  (A.29)  is  again  satisfied. 

Informally  stated,  the  eigenvalues  of  the  modified  matrix  “interlace”  the  eigenvalues  of  the 
original  matrix,  with  nonnegative  adjustments  if  the  coefficient  a  is  positive,  and  nonpositive 
adjustments  if  a  is  negative.  The  total  magnitude  of  the  adjustments  equals  a,  whose 
magnitude  is  identical  to  the  Euclidean  norm  ||  azzT  ||  2  of  the  modification. 

ERROR  ANALYSIS  AND  FLOATING-POINT  ARITHMETIC 

In  most  of  this  book  our  algorithms  and  analysis  deal  with  real  numbers.  Modern 
digital  computers,  however,  cannot  store  or  compute  with  general  real  numbers.  Instead, 
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they  work  with  a  subset  known  as  floating-point  numbers.  Any  quantities  that  are  stored 
on  the  computer,  whether  they  are  read  directly  from  a  file  or  program  or  arise  as  the 
intermediate  result  of  a  computation,  must  be  approximated  by  a  floating-point  number. 
In  general,  then,  the  numbers  that  are  produced  by  practical  computation  differ  from  those 
that  would  be  produced  if  the  arithmetic  were  exact.  Of  course,  we  try  to  perform  our 
computations  in  such  a  way  that  these  differences  are  as  tiny  as  possible. 

Discussion  of  errors  requires  us  to  distinguish  between  absolute  error  and  relative 
error.  If  x  is  some  exact  quantity  (scalar,  vector,  matrix)  and  x  is  its  approximate  value,  the 
absolute  error  is  the  norm  of  the  difference,  namely,  \\x  —  JE||.  (In  general,  any  of  the  norms 
(A.2a),  (A.2b),  and  (A.2c)  can  be  used  in  this  definition.)  The  relative  error  is  the  ratio  of 
the  absolute  error  to  the  size  of  the  exact  quantity,  that  is, 


ll*  —  II 
||  jc  || 

When  this  ratio  is  significantly  less  than  one,  we  can  replace  the  denominator  by  the  size  of 
the  approximate  quantity — that  is,  ||Jc|| — without  affecting  its  value  very  much. 

Most  computations  associated  with  optimization  algorithms  are  performed  in  double¬ 
precision  arithmetic.  Double-precision  numbers  are  stored  in  words  of  length  64  bits.  Most 
of  these  bits  (say  t )  are  devoted  to  storing  the  fractional  part,  while  the  remainder  encode 
the  exponent  e  and  other  information,  such  as  the  sign  of  the  number,  or  an  indication  of 
whether  it  is  zero  or  “undefined.”  Typically,  the  fractional  part  has  the  form 

.d\d2 . . .  df , 

where  each  d, ,  i  =  1 ,  2, . . . ,  t,  is  either  zero  or  one.  (In  some  systems  d\  is  implicitly  assumed 
to  be  1  and  is  not  stored.)  The  value  of  the  floating-point  number  is  then 

t 

Y d,-2 x  2e. 

i= 1 

The  value  2~'~ 1  is  known  as  unit  roundoff  and  is  denoted  by  u.  Any  real  number  whose 
absolute  value  lies  in  the  range  [2L,  2U]  (where  L  and  U  are  lower  and  upper  bounds  on 
the  value  of  the  exponent  e )  can  be  approximated  to  within  a  relative  accuracy  of  u  by  a 
floating-point  number,  that  is, 

fl(x)  =  x{\  +  e),  where  |e|  <  u,  (A.30) 

where  fl(-)  denotes  floating-point  approximation.  The  value  of  u  for  double-precision  IEEE 
arithmetic  is  about  1.1  x  10-16.  In  other  words,  if  the  real  number  x  and  its  floating-point 
approximation  are  both  written  as  base-10  numbers  (the  usual  fashion),  they  agree  to  at 
least  15  digits. 
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For  further  information  on  floating-point  computations,  see  Overton  [233],  Golub 
and  Van  Loan  [136,  Section  2. 4],  and  Higham  [169], 

When  an  arithmetic  operation  is  performed  with  one  or  two  floating-point  numbers, 
the  result  must  also  be  stored  as  a  floating-point  number.  This  process  introduces  a  small 
roundoff  error,  whose  size  can  be  quantified  in  terms  of  the  size  of  the  arguments.  If  x  and  y 
are  two  floating-point  numbers,  we  have  that 


|fl(jc  *  y)  —  x  *  y\  <  u|x  *  y |, 


(A.31) 


where  *  denotes  any  of  the  operations  +,  — ,  x,  H-. 

Although  the  error  in  a  single  floating-point  operation  appears  benign,  more  signifi¬ 
cant  errors  may  occur  when  the  arguments  x  and  y  are  floating-point  approximations  of  two 
real  numbers,  or  when  a  sequence  of  computations  are  performed  in  succession.  Suppose, 
for  instance,  that  x  and  y  are  large  real  numbers  whose  values  are  very  similar.  When  we 
store  them  in  a  computer,  we  approximate  them  with  floating-point  numbers  fl(x)  and  fl(y) 
that  satisfy 


fl(* )  —  x  +  ex,  fl(y)  =  y  +  €y,  where  |e*|  <  u|x|,  |ey|  <  u|y|. 

If  we  take  the  difference  of  the  two  stored  numbers,  we  obtain  a  final  result  fl(fl(x)  —  fl(y)) 
that  satisfies 


fl(fl(*)  -  fl(y))  =  (fl(*)  -  fl(v))(l  +  €xy),  where  |e,y|  <  u. 


By  combining  these  expressions,  we  find  that  the  difference  between  this  result  and  the  true 
value  x  —  y  may  be  as  large  as 


e x  A  cv  A  eXy, 


which  is  bounded  by  u(|x|  +  |y|  +  |x  —  v|).  Hence,  since  x  and  y  are  large  and  close 
together,  the  relative  error  is  approximately  2u|x|/|x  —  y|,  which  may  be  quite  large,  since 

1*1  »  I*  -  y\- 

This  phenomenon  is  known  as  cancellation.  It  can  also  be  explained  (less  formally) 
by  noting  that  if  both  x  and  y  are  accurate  to  k  digits,  and  if  they  agree  in  the  first  k 
digits,  then  their  difference  will  contain  only  about  k  —  k  significant  digits — the  first  k  digits 
cancel  each  other  out.  This  observation  is  the  reason  for  the  well-known  adage  of  numerical 
computing — that  one  should  avoid  taking  the  difference  of  two  similar  numbers  if  at  all 
possible. 
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CONDITIONING  AND  STABILITY 

Conditioning  and  stability  are  two  terms  that  are  used  frequently  in  connection  with 
numerical  computations.  Unfortunately,  their  meaning  sometimes  varies  from  author  to 
author,  but  the  general  definitions  below  are  widely  accepted,  and  we  adhere  to  them  in  this 
book. 

Conditioning  is  a  property  of  the  numerical  problem  at  hand  (whether  it  is  a  linear 
algebra  problem,  an  optimization  problem,  a  differential  equations  problem,  or  whatever). 
A  problem  is  said  to  be  well  conditioned  if  its  solution  is  not  affected  greatly  by  small 
perturbations  to  the  data  that  define  the  problem.  Otherwise,  it  is  said  to  be  ill  conditioned. 

A  simple  example  is  given  by  the  following  2x2  system  of  linear  equations: 


1 

to 

_ 1 

1 

X 

_ 1 

1 

l 

*2 

to 

1 _ 

By  computing  the  inverse  of  the  coefficient  matrix,  we  find  that  the  solution  is  simply 


Xi 

-1  2 

3 

1 

X2 

1  -1 

2 

1 

If  we  replace  the  first  right-hand-side  element  by  3.00001,  the  solution  becomes  (x\,  xi)7  = 
(0.99999,  1.00001)r,  which  is  only  slightly  different  from  its  exact  value  (1,  l)r.  We  would 
note  similar  insensitivity  if  we  were  to  perturb  the  other  elements  of  the  right-hand-side  or 
elements  of  the  coefficient  matrix.  We  conclude  that  this  problem  is  well  conditioned.  On 
the  other  hand,  the  problem 


1.00001 

1 

Xl 

2.00001 

1 

1 

X2 

2 

is  ill  conditioned.  Its  exact  solution  is  x  —  (1,  l)r,  but  if  we  change  the  first  element  of  the 
right-hand-side  from  2.00001  to  2,  the  solution  would  change  drastically  to  x  —  (0,  2)r. 

For  general  square  linear  systems  Ax  —  b  where  A  e  R"x",  the  condition  number  of 
the  matrix  (defined  in  (A.l  1))  can  be  used  to  quantify  the  conditioning.  Specifically,  if  we 
perturb  A  to  A  and  b  to  b  and  take  x  to  be  the  solution  of  the  perturbed  system  Ax  =  b,  it 
can  be  shown  that 


*(A) 


IA  —  All 


\\b-b\\ 


(see,  for  instance,  Golub  and  Van  Loan  [136,  Section  2.7] ).  Hence,  a  large  condition  number 
k(A )  indicates  that  the  problem  Ax  =  b  is  ill  conditioned,  while  a  modest  value  indicates 
well  conditioning. 
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Note  that  the  concept  of  conditioning  has  nothing  to  do  with  the  particular  algorithm 
that  is  used  to  solve  the  problem,  only  with  the  numerical  problem  itself. 

Stability ,  on  the  other  hand,  is  a  property  of  the  algorithm.  An  algorithm  is  stable  if 
it  is  guaranteed  to  produce  accurate  answers  to  all  well- conditioned  problems  in  its  class, 
even  when  floating-point  arithmetic  is  used. 

As  an  example,  consider  again  the  linear  equations  Ax  =  b.  We  can  show  that 
Algorithm  A.l,  in  combination  with  triangular  substitution,  yields  a  computed  solution  x 
whose  relative  error  is  approximately 


x  —  x  ,  .  growth! A) 

-  es  k{A) - u,  (A.32) 

||x  ||  ||  A  || 

where  growth(A)  is  the  size  of  the  largest  element  that  arises  in  A  during  execution  of 
Algorithm  A.l.  In  the  worst  case,  we  can  show  that  growth(A)/||  A||  may  be  around  2"_1, 
which  indicates  that  Algorithm  A.l  is  an  unstable  algorithm,  since  even  for  modest  n  (say, 
n  —  200),  the  right-hand-side  of  (A.32)  maybe  large  even  when  k  (A)  is  modest.  In  practice, 
however,  large  growth  factors  are  rarely  observed,  so  we  conclude  that  Algorithm  A.l  is 
stable  for  all  practical  purposes. 

Gaussian  elimination  without  pivoting,  on  the  other  hand,  is  definitely  unstable.  If 
we  omit  the  possible  exchange  of  rows  in  Algorithm  A.l,  the  algorithm  will  fail  to  produce 
a  factorization  even  of  some  well-conditioned  matrices,  such  as 


A  = 


0 

1 


1 

2 


For  systems  Ax  =  b  in  which  A  is  symmetric  positive  definite,  the  Cholesky  fac¬ 
torization  in  combination  with  triangular  substitution  constitutes  a  stable  algorithm  for 
producing  a  solution  x. 


A.2  ELEMENTS  OF  ANALYSIS,  GEOMETRY,  TOPOLOGY 

SEQUENCES 

Suppose  that  {x*}  is  a  sequence  of  points  belonging  to  R" .  We  say  that  a  sequence  {xa-} 
converges  to  some  point  x,  written  limA^oo  xa  =  x,  if  for  any  e  >  0,  there  is  an  index  K 
such  that 


||xa  —  x ||  <  e,  for  all  k  >  K. 


For  example,  the  sequence  {x> }  defined  by  xa  =  ( 1  —  2  * ,  1  /  k2 ) T  converges  to  ( 1 ,  0) T . 
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Given  a  index  set  S  C  {1,  2,  3,  . . .},  we  can  define  a  subsequence  oi {4  }  corresponding 
to  S,  and  denote  it  by 

Wesaythatx  e  R"  is  an  accumulation  point  or  limit  point  for  {xk}  if  there  is  an  infinite 
set  of  indices  ki,  £2.  k^,  ...  such  that  the  subsequence  {xk.}i =1,2,3,...  converges  to  x;  that  is, 

lim  xki  —  x. 

i  — >  00 


Alternatively,  we  say  that  for  any  e  >  0  and  all  positive  integers  K,  we  have 

\\xk  -x\\  <€,  forsom ek>K. 


An  example  is  given  by  the  sequence 


1 

1 

'io 

_ 1 

1 

1 

1 _ 

1 

1 

1 

1 

1 

1 

5 

(A.33) 


which  has  exactly  two  limit  points:  x  —  (0,  0)r  and  x  —  (1,  l)r.  A  sequence  can  even  have 
an  infinite  number  of  limit  points.  An  example  is  the  sequence  xk  —  sin  k,  for  which  every 
point  in  the  interval  [  —  1 ,  1  ]  is  a  limit  point.  A  sequence  converges  if  and  only  if  it  has  exactly 
one  limit  point. 

A  sequence  is  said  to  be  a  Cauchy  sequence  if  for  any  e  >  0,  there  exists  an  integer  K 
such  that  \\xk  —  x/||  <  e  for  all  indices  k  >  K  and  I  >  K.  A  sequence  converges  if  and  only 
if  it  is  a  Cauchy  sequence. 

We  now  consider  scalar  sequences  {4},  that  is,  4  e  R  for  all  k.  This  sequence  is  said 
to  be  bounded  above  if  there  exists  a  scalar  u  such  that  4  <  u  for  all  k,  and  bounded  below 
if  there  is  a  scalar  v  with  4  >  v  for  all  k.  The  sequence  {4}  is  said  to  be  nondecreasing 
if  4+i  >  4  for  all  k,  and  nonincreasing  if  tk+\  <  4  for  all  k.  If  {4}  is  nondecreasing  and 
bounded  above,  then  it  converges,  that  is,  lim^oo  4  =  t  for  some  scalar  t.  Similarly,  if  {4} 
is  nonincreasing  and  bounded  below,  it  converges. 

We  define  the  supremum  of  the  scalar  sequence  {4}  as  the  smallest  real  number  u  such 
that4  <  irforall  k=  1,  2,  3, . . .,  and  denote  it  by  sup{4}.  The  infimum,  denoted  by  inf{4}, 

is  the  largest  real  number  v  such  that  v  <  4  for  all  k  —  1,  2,  3 . We  can  now  define  the 

sequence  of  suprema  as  {m,  },  where 


def 


=  sup{4  |  k  >  ;'}. 


Clearly,  {zq}  is  a  nonincreasing  sequence.  If  bounded  below,  it  converges  to  a  finite  number 
ii,  which  we  call  the  “lim  sup”  of  {4},  denoted  by  lim  sup  4.  Similarly,  we  can  denote  the 
sequence  of  infima  by  {14},  where 
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which  is  nondecreasing.  If  {i>,- }  is  bounded  above,  it  converges  to  a  point  v  which  we  call  the 
“lim  inf”  of  {4. } ,  denoted  by  lim  inf  4 .  As  an  example,  the  sequence  1,  \ ,  1,  | ,  1,  | , . . .  has 
a  lim  inf  of  0  and  a  lim  sup  of  1 . 

RATES  OF  CONVERGENCE 

One  of  the  key  measures  of  performance  of  an  algorithm  is  its  rate  of  convergence. 
Here,  we  define  the  terminology  associated  with  different  types  of  convergence. 

Let  {x*}  be  a  sequence  in  R"  that  converges  to  x*.  We  say  that  the  convergence  is 
Q-litiear  if  there  is  a  constant  r  e  (0,  1)  such  that 

||Xt_i_i  —  x*  II 

-  <  r,  for  all  A:  sufficiently  large.  (A.34) 

II xk  -  x*|| 

This  means  that  the  distance  to  the  solution  x*  decreases  at  each  iteration  by  at  least  a 
constant  factor  bounded  away  from  1.  For  example,  the  sequence  1  +  (0.5)A  converges 
Q-linearly  to  1,  with  rate  r  =  0.5.  The  prefix  “Q”  stands  for  “quotient,”  because  this  type  of 
convergence  is  defined  in  terms  of  the  quotient  of  successive  errors. 

The  convergence  is  said  to  be  Q-superlinear  if 


lim 

k — >00 


ll**+l  -  X* 

\\xk-x*\\ 


=  0. 


For  example,  the  sequence  1  +  k  k  converges  superlinearly  to  1.  (Prove  this  statement!) 
Q-quadratic  convergence,  an  even  more  rapid  convergence  rate,  is  obtained  if 


ll**+l  -  -T*  I 

I \Xk  -  V*||2 


<  M, 


for  all  k  sufficiently  large, 


where  M  is  a  positive  constant,  not  necessarily  less  than  1.  An  example  is  the  sequence 
1  +  (0.5)2\ 

The  speed  of  convergence  depends  on  r  and  (more  weakly)  on  M,  whose  values  depend 
not  only  on  the  algorithm  but  also  on  the  properties  of  the  particular  problem.  Regardless  of 
these  values,  however,  a  quadratically  convergent  sequence  will  always  eventually  converge 
faster  than  a  linearly  convergent  sequence. 

Obviously,  any  sequence  that  converges  Q-quadratically  also  converges  Q-super- 
linearly,  and  any  sequence  that  converges  Q-superlinearly  also  converges  Q-linearly.  We 
can  also  define  higher  rates  of  convergence  (cubic,  quartic,  and  so  on),  but  these  are  less 
interesting  in  practical  terms.  In  general,  we  say  that  the  Q-order  of  convergence  is  p  (with 
p  >  1 )  if  there  is  a  positive  constant  M  such  that 


ll**+l  ~  -T*  || 
II**  -  **IIP 


<  M,  for  all  k  sufficiently  large. 
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Quasi-Newton  methods  for  unconstrained  optimization  typically  converge  Q- 
superlinearly,  whereas  Newton’s  method  converges  Q-quadratically  under  appropriate 
assumptions.  In  contrast,  steepest  descent  algorithms  converge  only  at  a  Q-linear  rate, 
and  when  the  problem  is  ill-conditioned  the  convergence  constant  r  in  (A.34)  is  close  to  1. 

In  the  book,  we  omit  the  letter  Q  and  simply  talk  about  superlinear  convergence, 
quadratic  convergence,  and  so  on. 

A  slightly  weaker  form  of  convergence,  characterized  by  the  prefix  “R”  (for  “root”),  is 
concerned  with  the  overall  rate  of  decrease  in  the  error,  rather  than  the  decrease  over  each 
individual  step  of  the  algorithm.  We  say  that  convergence  is  R-linear  if  there  is  a  sequence  of 
nonnegative  scalars  {v*}  such  that 

|| Xk  —  x*  ||  <  Vk  for  all  A,  and  { 14}  converges  Q-linearly  to  zero. 


The  sequence  {||x*  —  x*||}  is  said  to  be  dominated  by  {v*}.  For  instance,  the  sequence 


|  1  +  (0.5)*’,  k  even, 
|  1,  A:  odd, 


(A.35) 


(the  first  few  iterates  are  2,  1,  1.25,  1,  1.03125,  1, . . .)  converges  R-linearly  to  1,  because  we 
have  (1  +  (0.5)*)  —  1|  =  (0.)*,  and  the  sequence  {(0.5)*}  converges  Q-linearly  to  zero. 
Likewise,  we  say  that  {x*}  converges  R-superlinearly  to  x*  if  { ||x*  —  x*  || }  is  dominated  by  a 
sequence  of  scalars  converging  Q-superlinearly  to  zero,  and  {xQ  converges  R-quadratically 
to  x*  if  { \\xk  —  x*  || }  is  dominated  by  a  sequence  converging  Q-quadratically  to  zero. 

Note  that  in  the  R-linear  sequence  (A.35),  the  error  actually  increases  at  every  second 
iteration!  Such  behavior  occurs  even  in  sequences  whose  R-rate  of  convergence  is  arbitrarily 
high,  but  it  cannot  occur  for  Q-linear  sequences,  which  insist  on  a  decrease  at  every  step  k, 
for  k  sufficiently  large. 

For  an  extensive  discussion  of  convergence  rates  see  Ortega  and  Rheinboldt  [230]. 


TOPOLOGY  OF  THE  EUCLIDEAN  SPACE  R" 

The  set  T  is  bounded  if  there  is  some  real  number  M  >  0  such  that 

Ik  II  <  M,  for  all  x  e  T. 

A  subset  T  C  R"  is  open  if  for  every  x  e  T,  we  can  find  a  positive  number  e  >  0  such  that 
the  ball  of  radius  e  around  x  is  contained  in  T\  that  is, 

{y  e  R"  |  || y  —  x||  <  e}  c  T. 

The  set  T  is  closed  if  for  all  possible  sequences  of  points  {x*}  in  T,  all  limit  points  of  {x*} 
are  elements  of  T .  For  instance,  the  set  T  —  (0,  1)  U  (2,  10)  is  an  open  subset  of  R,  while 
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T  —  [0,  1]  U  [2,  5]  is  a  closed  subset  of  R.  The  set  T  —  (0,  1]  is  a  subset  of  R  that  is  neither 
open  nor  closed. 

The  interior  of  a  set  T,  denoted  by  int  T,  is  the  largest  open  set  contained  in  T.  The 
closure  of  T,  denoted  by  cl  T,  is  the  smallest  closed  set  containing  T.  In  other  words,  we 
have 


x  e  cLF  if  lim^oo  x*  —  x  for  some  sequence  {xa-}  of  points  in  T . 
\iT  —  (—1,  1]  U  [2,  4),  then 


cLF  =  [-1,  1]  U  [2,4],  int  T  —  (-1,  1)  U  (2,4). 

Note  that  if  T  is  open,  then  int  T  —  T,  while  if  T  is  closed,  then  cl  T  —  T. 

We  note  the  following  facts  about  open  and  closed  sets.  The  union  of  finitely  many 
closed  sets  is  closed,  while  any  intersection  of  closed  sets  is  closed.  The  intersection  of  finitely 
many  open  sets  is  open,  while  any  union  of  open  sets  is  open. 

The  set  T  is  compact  if  every  sequence  [xk]  of  points  in  T  has  at  least  one  limit  point, 
and  all  such  limit  points  are  in  T.  (This  definition  is  equivalent  to  the  more  formal  one 
involving  covers  of  T.)  The  following  is  a  central  result  in  topology: 

T  e  R”  is  closed  and  bounded  =4>  T  is  compact. 

Given  a  point  x  e  R” ,  we  call  jVeR’a  neighborhood  ofx  if  it  is  an  open  set  containing 
x.  An  especially  useful  neighborhood  is  the  open  ball  of  radius  e  around  x,  which  is  denoted 
by  B(x,  e);  that  is, 


B(x,  e)  =  [y  |  \\y  -x||  <  e}. 

Given  a  set  T  C  R",  we  say  that  M  is  a  neighborhood  of  T  if  there  is  e  >  0  such  that 


LWB(x,  e)  C  Jf. 


CONVEX  SETS  IN  R" 


A  convex  combination  of  a  finite  set  of  vectors  {xi,X2,  . . . ,  xm }  in  R"'  is  any  vector  x 
of  the  form 


m 

x  =  y^o '.jXj, 

i= 1 


m 

where  a,  =  1,  and  a  -,  >  0  for  all  i  =  1,2,...,  m. 

;=i 


The  convex  hull  of  {xi,  X2, . . . ,  xm}  is  the  set  of  all  convex  combinations  of  these  vectors. 
A  cone  is  a  set  T  with  the  property  that  for  all  x  e  T  we  have 


xeT  =>■  crx  e  T,  for  all  a  >  0. 


(A.36) 
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For  instance,  the  set  T  C  R2  defined  by 


{(*i,  x2)t  |  xi  >  0,  x2  >  0} 


is  a  cone  in  R2.  Note  that  cones  are  not  necessarily  convex.  For  example,  the  set 
{(jci,  X2V  |  Xi  >  0  or  *2  >  0},  which  encompasses  three  quarters  of  the  two-dimensional 
plane,  is  a  cone. 

The  cone  generated  by  {x\,X2, ... ,  xm}  is  the  set  of  all  vectors  x  of  the  form 


m 

x  =  a,Xi ,  where  a,  >  0  for  all  i  —  1,2 , m. 

1=1 

Note  that  all  cones  of  this  form  are  convex. 

Finally,  we  define  the  affine  hull  and  relative  interior  of  a  set.  An  affine  set  in  R”  is  a 
the  set  of  all  vectors  {*}  ©  S,  where  x  e  R"  and  S  is  a  subspace  of  R".  Given  T  C  R", 
the  affine  hull  of  T  (denoted  by  aff  JA)  is  the  smallest  affine  set  containing  T .  For  instance, 
when  T  is  the  “ice-cream  cone”  defined  in  three  dimensions  as 


Q  = 


(A.37) 


(see  Figure  A.l),  we  have  affj7  —  R3.  If  T  is  the  set  of  two  isolated  points  T  — 
{(1,  0,  0)T,  (0,  2,  0)r},  we  have 

aff  JT  —  {(1,  0,  0)T  +  a(— 1,  2,  0)r  |  for  all  a  e  R}. 


Figure  A.l  “Ice-cream  cone”  set. 
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The  relative  interior  ri  T  of  the  set  T  is  its  interior  relative  to  affLF.  If  x  G  LF,  then 
x  G  ri  T  if  there  is  an  e  >0  such  that 


{x  +  eB)  fl  affLF  c  T . 

Referring  again  to  the  ice-cream  cone  (A.37),  we  have  that 


ri  T  = 


x  e  R 


X3  >  2J  xi  +  *2  r  • 


For  the  set  of  two  isolated  points  T  —  {(1,  0,  0)r,  (0,  2,  0)r],  we  have  riLF  =  0.  For  the  set 
T  defined  by 

J~  —  [x  G  R  |  X\  G  [0,  1],  X2  G  [0,  1],  X3  —  0}, 


we  have  that 

affLF  =  R  x  R  x  {0},  riLF  =  {x  G  R3 1  x\  G  (0,  1),  X2  G  (0,  1),  X3  =  0}. 

CONTINUITY  AND  LIMITS 

Let  /  be  a  function  that  maps  some  domain  V  C  R"  to  the  space  Rm .  For  some  point 
x0  G  cYD,  we  write 


lim  f{x)  =  f0  (A.38) 

(spoken  “the  limit  of  f{x)  as  x  approaches  xo  is  /o”)  if  for  all  e  >  0,  there  is  a  value  S  >  0 
such  that 


x  —  Xo||  <8  and  x  G  V  =>  ||/(x)  —  /o||  <  e. 


We  say  that  /  is  continuous  at  xo  if  xo  G  V  and  the  expression  (A.38)  holds  with  fo  —  fix  0). 
We  say  that  /  is  continuous  on  its  domain  V  if  /  is  continuous  for  all  x0  G  V. 

An  example  is  provided  by  the  function 


fix)  = 


— x  ifx  G  [—1,  l],x  0, 

5  for  all  other  x  G  [—10,  10]. 


(A.39) 


This  function  is  defined  on  the  domain  [—10,  10]  and  is  continuous  at  all  points  of  the 
domain  except  the  points  x  =  0,  x  =  1,  and  x  =  —  1.  At  x  =  0,  the  expression  (A.38)  holds 
with  f0  —  0,  but  the  function  is  not  continuous  at  this  point  because  fo  /( 0)  =  5.  At 
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x  —  —  1,  the  limit  (A.38)  is  not  defined,  because  the  function  values  in  the  neighborhood  of 
this  point  are  close  to  both  5  and  —  1,  depending  on  whether  x  is  slightly  smaller  or  slightly 
larger  than  —1.  Hence,  the  function  is  certainly  not  continuous  at  this  point.  The  same 
comments  apply  to  the  point  x  —  1. 

In  the  special  case  of  n  —  1  (that  is,  the  argument  of  /  is  a  real  scalar),  we  can  also 
define  the  one-sided  limit.  Given  Xg  e  cYD,  We  write 

lim  f{x)  =  f0  (A.40) 

*4*o 

(spoken  “the  limit  of  /(x)  as  x  approaches  x0  from  above  is  fg”)  if  for  all  e  >  0,  there  is  a 
value  8  >  0  such  that 

Xg  <  x  <  Xg  +  S  and  x  e  V  =>•  ||  f{x)  —  fg\\  <  e. 

Similarly,  we  write 


lim  f{x)  =  fg  (A.41) 

xfx0 

(spoken  “the  limit  of  f(x)  as  x  approaches  xg  from  below  is  fg”)  if  for  all  e  >  0,  there  is  a 
value  8  >  0  such  that 


xg  —  8  <  x  <  xg  and  x  e  V  =>■  ||  f(x)  —  /o||  <  e. 

For  the  function  defined  in  (A.39),  we  have  that 

lim  f(x)  =  5,  lim  fix)  —  1. 

*4.1  *ti 

Considering  again  the  general  case  of  /  :  V  -*  R"'  where  V  C  R”  for  general  m  and 
n.  The  function  /  is  said  to  be  Lipschitz  continuous  on  some  set  M  C  V  if  there  is  a  constant 
L  >  0  such  that 


\\f(xi)  -  f(x0)\\  <  L\\x1  -  x0\\,  for  all  x0,  X\  e  A/\  (A.42) 

(L  is  called  the  Lipschitz  constant.)  The  function  /  is  locally  Lipschitz  continuous  at  a  point 
x  e  mLD  if  there  is  some  neighborhood  J\f  of  x  with  J\f  C  V  such  that  the  property  (A.42) 
holds  for  some  L  >  0. 

If  g  and  h  are  two  functions  mapping  V  C  R"  to  R'”,  Lipschitz  continuous  on  a 
set  Af  C  V,  their  sum  g  +  h  is  also  Lipschitz  continuous,  with  Lipschitz  constant  equal  to 
the  sum  of  the  Lipschitz  constants  for  g  and  h  individually.  If  g  and  h  are  two  functions 
mapping  V  C  R"  to  R,  the  product  gh  is  Lipschitz  continuous  on  a  set  Af  C  V  if  both  g 
and  li  are  Lipschitz  continuous  on  Af  and  both  are  bounded  on  Af  (that  is,  there  is  M  >  0 
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such  that  |g(x)|  <  M  and  \h(x)\  <  M  for  all  x  e  A f).  We  prove  this  claim  via  a  sequence 
of  elementary  inequalities,  for  arbitrary.ro,  Xi  e  AT: 


|g(x0)/r(x0)  -  g(xi)/?(vi)| 

<  \g(x0)h{x0)  -  g(x1)h(x0)\  +  \g(xi)h{x0)  -  g{xi)h{xi)\ 

=  \h(x0)\  |g(x0)  -  g(xi)|  +  |g(xi)|  \h(x0)  -  h(xi)\ 

<  2ML\\xq  —  Xi||,  (A.43) 


where  L  is  an  upper  bound  on  the  Lipschitz  constant  for  both  g  and  h . 


DERIVATIVES 

Let  (f>  :  R  — >  R  be  a  real-valued  function  of  a  real  variable  (sometimes  known  as  a 
univariate  function).  The  first  derivative  is  defined  by 


cl(f> 

da 


=  =  lim 

€— >0 


4>(a  +  e)  -  (p{a) 
€ 


(A.44) 


The  second  derivative  is  obtained  by  substituting  </>  by  </)'  in  this  same  formula;  that  is, 


d24> 

da2 


=  4>"{a)  =  lim 


<p'(a  +  e)  -  <f>'(a) 
€ 


(A.45) 


Suppose  now  that  a  in  turn  depends  on  another  quantity  /3  (we  denote  this  dependence  by 
writing  a  —  a(fi)).  We  can  use  the  chain  rule  to  calculate  the  derivative  of  </>  with  respect  to 
0: 


d<p{a{P)) 

dp 


d(f>  da 
da  d/3 


<p'{a)a\P). 


(A.46) 


Consider  now  the  function  /  :  R"  — »  R,  which  is  a  real-valued  function  of  n 
independent  variables.  We  typically  gather  the  variables  into  a  vector  x  =  (xi ,  X2 , . . . ,  x„ ) T . 
We  say  that  /  is  differentiable  at  x  if  there  exists  a  vector  g  e  R"  such  that 


r  f(x  +  y)-f(x)-gTy 

lim - 

Y-0  II  Til 


=  0, 


(A.47) 


where  ||  •  ||  is  any  vector  norm  of  y.  (This  type  of  differentiability  is  known  as  Frechet 
differentiability.)  If  g  satisfying  (A.47)  exists,  we  call  it  the  gradient  of  /  at  x,  and  denote  it 
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by  V/(x),  written  componentwise  as 


V/(jc)  = 


9/  - 

3xi 

df_ 

dx„  - 


(A.48) 


Here,  df/dxj  represents  the  partial  derivative  of  /  with  respect  to  x, .  By  setting  y  —  ee,  in 
(A.47),  where  e,  is  the  vector  in  R"  consisting  all  all  zeros,  except  for  a  1  in  position  i,  we 
obtain 


3  Xi 

def  fix  1,  ■  •  .,Xi-i,Xi  +  €,Xi+i,  .  .  ,,X„)  -  fix  i, - Xi-i,Xi,Xi+u  .  .  . ,  Xn) 

=  lim  - 

_  f(x  +  eej)  -  f(x) 

€ 


A  gradient  with  respect  to  only  a  subset  of  the  unknowns  can  be  expressed  by  means 
of  a  subscript  on  the  symbol  V.  Thus  for  the  function  of  two  vector  variables  f(z,  t),  we  use 
V,/(z,  r)  to  denote  the  gradient  with  respect  to  z  (holding  t  constant). 

The  matrix  of  second  partial  derivatives  of  /  is  known  as  the  Hessian ,  and  is  defined 
as 


s2/ 

92/ 

3  2f 

3x2 

3  2f 

8x18x2 

9V 

3xi  3  x, 

3  2f 

8x28x1 

9x| 

3x2  9x, 

3  2f 

3  2f 

9 V 

3x„3xi 

dx„  3x2 

3x2 

We  say  that  /  is  differentiable  on  a  domain  T>  if  V/(jc)  exists  for  all  x  e  V,  and  continuously 
differentiable  if  V  f  (x )  is  a  continuous  functions  of  x .  Similarly,  /  is  twice  differentiable  on  V 
if  V2/(x)  exists  for  all  x  e  V  and  twice  continuously  differentiable  if  V2 / (x)  is  continuous 
on  V.  Note  that  when  /  is  twice  continuously  differentiable,  the  Hessian  is  a  symmetric 
matrix,  since 


92/  _  3  2f 

3  Xj  3  Xj  3  Xj  3  xt  ’ 


for  all  i,  j  =  1,2, ...  ,n. 


When  /  is  a  vector  valued  function  that  is  /  :  R”  ->  R'"  (See  Chapters  10  and  11), 
we  define  V  fix)  to  be  the  n  x  m  matrix  whose  i  th  column  is  V  f  (x ),  that  is,  the  gradient  of 
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fi  with  respect  to  x.  Often,  for  notational  convenience,  we  prefer  to  work  with  the  transpose 
of  his  matrix,  which  has  dimensions  m  x  n .  This  matrix  is  called  the  Jacobian  and  is  often 
denoted  by  J(x).  Specifically,  the  (i,  j)  element  of  J(x)  is  dfi(x)/dxj. 

When  the  vector  x  in  turn  depends  on  another  vector  t  (that  is,  x  —  x(t)),  we  can 
extend  the  chain  rule  (A.46)  for  the  univariate  function.  Defining 

h(t)  =  f(x(t)),  (A.49) 

we  have 

fl  o  r 

V/'(f)  =  J2  JTVXi{t)  =  V*(')V/(x(t)).  (A.50) 

i=l 


□  Example  A.  1 

Let  /  :  R2  — >  R  be  defined  by  f(x i,  x2)  —  x\  +  XiX2,  where  X\  —  sintj  +  t\  and 
x2  =  (fi  +  fi)2.  Defining  h(t)  as  in  (A.49),  the  chain  rule  (A.50)  yields 


Vh{t) 


+  X\ 

—  (2  (sinfi  +  tj)  +  (ti  +  fi)“) 

If,  on  the  other  hand,  we  substitute  directly  for  x  into  the  definition  of  /,  we  obtain 

h(t)  =  f(x(t))  —  (sin  t\  +  tj )  +  (sin fi  +  )  (fi  +  fi)2- 

The  reader  should  verify  that  the  gradient  of  this  expression  is  identical  to  the  one  obtained 
above  by  applying  the  chain  rule. 


=  (2xi  +  x2) 


cosfi 

2fi 


2(fi  +  fi) 
2(fi  +  fi) 

COS  ti 

2fi 


+  (sin  t\  + 


2(fi  +  fi) 
2(fi  +  fi) 


Special  cases  of  the  chain  rule  can  be  derived  when  x{t )  in  (A.50)  is  a  linear  function 
of  t,  say  x(t)  —  Ct.  We  then  have  Vx(t)  =  CT ,  so  that 


V/i(f)  =  CrV/(Cr). 
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In  the  case  in  which  /  is  a  scalar  function,  we  can  differentiate  twice  using  the  chain  rule  to 
obtain 


S7zh(t)  =  CrV2/(Cf)C. 


(The  proof  of  this  statement  is  left  as  an  exercise.) 

DIRECTIONAL  DERIVATIVES 

The  directional  derivative  of  a  function  /  :  R"  -a  E  in  the  direction  p  is  given  by 


.  ,  .  def  ,.  f(x  +  ep)  -  f{x ) 

D(f(x);  p)  =  lira - . 

e->0  e 


(A.51) 


The  directional  derivative  may  be  well  defined  even  when  /  is  not  continuously  differen¬ 
tiable;  in  fact,  it  is  most  useful  in  such  situations.  Consider  for  instance  the  i\  norm  function 
fix)  —  ||x||i.  We  have  from  the  definition  (A.51)  that 

D(Wi;  „)  =  lim  =  lim  Ski  M 

6 — >-0  e  6 — >  o  e 


If  Xj  >  0,  we  have  |x,-  +  ep,\  =  |x,|  +  ept  for  all  e  sufficiently  small.  If  x,  <  0,  we  have 
|  x,-  +  epi\  —  |jc,-|  —  e  Pf ,  while  if  x,  —  0,  we  have  |  Xj  +  ep, —  e\pi  |.  Therefore,  we  have 

D(\\x\\up)=  Y  ~Pi+  Y  Pi+^1  l^'l> 

i  |jc/  <0  i\xi>0  /  |jc,-=0 

so  the  directional  derivative  of  this  function  exists  for  any  x  and  p.  The  first  derivative 
V  fix)  does  not  exist,  however,  whenever  any  of  the  components  of  x  are  zero. 

When  /  is  in  fact  continuously  differentiable  in  a  neighborhood  of  x,  we  have 

D(fix);  p)  =  V fix)7 p. 

To  verify  this  formula,  we  define  the  function 


fia)  =  fix  +  ap)  =  fiyia)), 


(A.52) 


where  y(a)  —  x  +  ap.  Note  that 

fix  +  ep)  -  fix)  (pie)  -0(0) 


lim 

e— >0 


=  lim 

e->-0  £ 


=  <P'i  o). 
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By  applying  the  chain  rule  (A.50)  to  f(y(a)),  we  obtain 


*'<«>  =  E^h^Vy,(„) 


i= 1 


3y(- 


(A.53) 


9f(y(a))  T  T 

=  X! - ~ - Pi  =  V/(y(a))rp  =  V/(x  +  a/;»)V 


i=l 


dyi 


We  obtain  (A.51)  by  setting  a  —  0  and  comparing  the  last  two  expressions. 


MEAN  VALUE  THEOREM 

We  now  recall  the  mean  value  theorem  for  univariate  functions.  Given  a  continuously 
differentiable  function  0  :  R  — »■  R  and  two  real  numbers  a0  and  that  satisfy  a:  >  a0,  we 
have  that 


</>(ai)  =  0(ao) +</>'(§)(ai -«o)  (A.54) 

for  some  f  e  (ag,  ai).  An  extension  of  this  result  to  a  multivariate  function  /  :  R"  — >  R  is 
that  for  any  vector  p  we  have 

fix  +  p)  =  /(x)  +  V/(x  +  a/T)r/T,  (A.55) 

for  some  a  e  (0,  1).  (This  result  can  be  proved  by  defining  </>(or)  =  fix  +  ap ),  o/0  —  0,  and 
=  1  and  applying  the  chain  rule,  as  above.) 


□  Example  A. 2 

Consider  /  :  R2  ->  R  defined  by  f(x)  —  x2  +  3xix|,  and  let  x  =  (0,  0)r  and 
p  —  (1,  2)T .  It  is  easy  to  verify  that  /(x)  =  0  and  /(x  +  p)  =  13.  Since 


V/(x  +ctp)  = 


3(xi  +  api)2  +  3(x2  +  ap2)2 

15a2 

6(xi  +  api){x2  +  api) 

12a2 

we  have  that  V/(x  +  ap)T p  =  39a2.  Hence  the  relation  (A.55)  holds  when  we  set  a  = 
l/\/l3,  which  lies  in  the  open  interval  (0,  1),  as  claimed. 
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An  alternative  expression  to  (A.55)  can  be  stated  for  twice  differentiable  functions: 
We  have 


fix  +  p)  =  f(x)  +  V  f  (x ) T  p  +  -pTV2f(x  +  ap)T  p, 


(A.56) 


for  some  a  e  (0,  1).  In  fact,  this  expression  is  one  form  of  Taylor’s  theorem,  Theorem  2.1  in 
Chapter  2,  to  which  we  refer  throughout  the  book. 

The  extension  of  (A.55)  to  a  vector-valued  function  r  :  R"  ->  R"'  for  m  >  1  is 
not  immediate.  There  is  in  general  no  scalar  a  such  that  the  natural  extension  of  (A.55)  is 
satisfied.  However,  the  following  result  is  often  a  useful  analog.  As  in  (10.3),  we  denote  the 
Jacobian  of  r(x),  by  J{x),  where  J(x)  is  the  m  x  n  matrix  whose  {j,  i )  entry  is  3 rj /3x,-,  for 
j  =  1,  2, . . . ,  m  and  i  —  1,2 , ,n,  and  asssume  that  J  (x)  is  defined  and  continuous  on 
the  domain  of  interest.  Given  x  and  p,  we  then  have 


r(x  +  p)  —  r(x) 


J(x  +  ap)p  da. 


(A.57) 


When  p  is  sufficiently  small  in  norm,  we  can  approximate  the  right-hand  side  of  this 
expression  adequately  by  J(x)p,  that  is, 


r(x  +  p)  —  r(x)  &  J(x)p. 


If  J  is  Lipschitz  continuous  in  the  vicinity  of  x  and  x  +  p  with  Lipschitz  constant  L,  we  can 
use  (A.  12)  to  estimate  the  error  in  this  approximation  as  follows: 


I \r(x  +  p)  -  r{x)  -  J(x)p ||  = 


f 


<-f 


[J{x  +  ap)  —  J(x)]/rdQ'| 

||  J(x  +  ap)  —  J(x) ||  ||p||  da 


f  La\\p\\2da  =  \L\\p\\2. 
Jo 


IMPLICIT  FUNCTION  THEOREM 

The  implicit  function  theorem  lies  behind  a  number  of  important  results  in  local 
convergence  theory  of  optimization  algorithms  and  in  the  characterization  of  optimality 
(see  Chapter  12).  Our  statement  ofthis  result  is  based  on  Lang  [187,  p.  131]  andBertsekas  [19, 
Proposition  A.25]. 
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Theorem  A.2  (Implicit  Function  Theorem). 

Leth  :  R"  x  R'"  —>  R"  be  a  function  such  that 

(i)  h{z*,  0)  =  0  for  some  z*  e  R", 

(ii)  the  function  /?(•,  •)  is  continuously  differentiable  in  some  neighborhood  of(z*,  0),  and 

(iii)  VJ?(c,  r)  is  nonsingular  at  the  point  {z,  t)  —  {z*,  0). 

Then  there  exist  open  sets  A fz  C  R"  and  A ft  C  Rm  containing  z*  and  0,  respectively,  and  a 
continuous  function  z  :  AT,  ->  Afz  such  that  z*  —  z( 0)  and  h(z(t),  t)  —  0  for  all  t  e  Aft. 
Further,  z{t)  is  uniquely  defined.  Finally,  ifh  is  p  times  continuously  differentiable  with  respect 
to  both  its  arguments  for  some  p  >  0,  thenz(t)  is  also  p  tunes  continuously  differentiable  with 
respect  to  t,  and  we  have 


Vz{t)  =  -VMz{t),t)[Vzh(,z{t),t)]-1 


for  all  t  e  Aft . 

This  theorem  is  frequently  applied  to  parametrized  systems  of  linear  equations,  in 
which  z  is  obtained  as  the  solution  of 


M(t)z  =  g(t), 

where  M(-)  e  R'1X''  has  M( 0)  nonsingular,  and  g(-)  e  R".  To  apply  the  theorem,  we  define 

h{z,  t )  =  M{t)z  -  g(t). 

If  M(-)  and  g(-)  are  continuously  differentiable  in  some  neighborhood  of  0,  the  theorem 
implies  that  z{t)  —  M(t)~1g{t )  is  a  continuous  function  of  t  in  some  neighborhood  of  0. 

ORDER  NOTATION 

In  much  of  our  analysis  we  are  concerned  with  how  the  members  of  a  sequence  behave 
eventually,  that  is,  when  we  get  far  enough  along  in  the  sequence.  For  instance,  we  might 
ask  whether  the  elements  of  the  sequence  are  bounded,  or  whether  they  are  similar  in  size 
to  the  elements  of  a  corresponding  sequence,  or  whether  they  are  decreasing  and,  if  so, 
how  rapidly.  Order  notation  is  useful  shorthand  to  use  when  questions  like  these  are  being 
examined.  It  saves  us  defining  many  constants  that  clutter  up  the  argument  and  the  analysis. 

We  will  use  three  varieties  of  order  notation:  O(-),  o(-),  and  f2(-).  Given  two 
nonnegative  infinite  sequences  of  scalars  {%•}  and  {14-},  we  write 


hk  =  0{  vk) 
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if  there  is  a  positive  constant  C  such  that 

\r)k\  <  C\vk\ 


for  all  k  sufficiently  large.  We  write 


m  =  o(yk) 


if  the  sequence  of  ratios  {iik/vk}  approaches  zero,  that  is, 

Til 

lim  —  =  0. 

k — >-oo  Vfc 


Finally,  we  write 


rjk  =  £2(vk) 


if  there  are  two  constants  Co  and  C i  with  0  <  Co  <  C\  <  oo  such  that 

C0 1 vt |  <  \r)k\  <  Ci  | Vi |, 

that  is,  the  corresponding  elements  of  both  sequences  stay  in  the  same  ballpark  for  all  k. 
This  definition  is  equivalent  to  saying  that  rik  —  0(vk)  and  vk  —  0(r)k). 

The  same  notation  is  often  used  in  the  context  of  quantities  that  depend  continuously 
on  each  other  as  well.  For  instance,  if  ?;(•)  is  a  function  that  maps  R  to  R,  we  write 

il(v)  =  O(v) 

if  there  is  a  constant  C  such  that  \r](v)\  <  C|v|  for  all  v  e  R.  (Typically,  we  are  interested 
only  in  values  of  v  that  are  either  very  large  or  very  close  to  zero;  this  should  be  clear  from 
the  context.  Similarly,  we  use 


r]{v)  =  o(v)  (A.58) 

to  indicate  that  the  ratio  rj(v)/v  approaches  zero  either  as  v  — >  0  or  v  ->  oo.  (Again,  the 
precise  meaning  should  be  clear  from  the  context.) 

As  a  slight  variant  on  the  definitions  above,  we  write 


n  k  =  <5(D 


to  indicate  that  there  is  a  constant  C  such  that  \r}k\  <  C  for  all  k,  while 


ilk  =  o(l) 
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indicates  that  limbec  r]k  =  0.  We  sometimes  use  vector  and  matrix  quantities  as  arguments, 
and  in  these  cases  the  definitions  above  are  intended  to  apply  to  the  norms  of  these  quantities. 
For  instance,  if  /  :  R"  -a  R",  we  write  fix)  =  <9(||x  ||)  if  there  is  a  constant  C  >  0  such 
that  ||/(x)||  <  C||x||  for  all x  in  the  domain  of  /.  Typically,  as  above,  we  are  interested  only 
in  some  subdomain  of  /,  usually  a  small  neighborhood  of  0.  As  before,  the  precise  meaning 
should  be  clear  from  the  context. 

ROOT-FINDING  FOR  SCALAR  EQUATIONS 

In  Chapter  1 1  we  discussed  methods  for  finding  solutions  of  nonlinear  systems  of 
equations  Fix)  =  0,  where  F  :  R"  — >  R".  Here  we  discuss  briefly  the  case  of  scalar 
equations  (n  =  1 ),  for  which  the  algorithm  is  easy  to  illustrate.  Scalar  root-finding  is  needed 
in  the  trust-region  algorithms  of  Chapter  4,  for  instance.  Of  course,  the  general  theorems  of 
Chapter  1 1  can  be  applied  to  derive  rigorous  convergence  results  for  this  special  case. 

The  basic  step  of  Newton’s  method  (Algorithm  Newton  of  Chapter  1 1 )  in  the  scalar 
case  is  simply 


Pk  —  -F{xk)/F'{xk),  xk+i  a-  xk  +  pk  (A.59) 

(cf.  (11.6)).  Graphically,  such  a  step  involves  taking  the  tangent  to  the  graph  of  F  at  the 
point  xk  and  taking  the  next  iterate  to  be  the  intersection  of  this  tangent  with  the  x  axis 
(see  Figure  A.2).  Clearly,  if  the  function  F  is  nearly  linear,  the  tangent  will  be  quite  a  good 
approximation  to  F  itself,  so  the  Newton  iterate  will  be  quite  close  to  the  true  root  of  F. 


Figure  A.2  One  step  of  Newton’s  method  for  a  scalar  equation. 


Figure  A.3  One  step  of  the  secant  method  for  a  scalar  equation. 


The  secant  method  for  scalar  equations  can  be  viewed  as  the  specialization  of  Broyden’s 
method  to  the  case  of  n  =  1 .  The  issues  are  simpler  in  this  case,  however,  since  the  secant 
equation  (11.27)  completely  determines  the  value  of  the  lxl  approximate  Hessian  B k. 
That  is,  we  do  not  need  to  apply  extra  conditions  to  ensure  that  Bk  is  fully  determined.  By 
combining  (11.24)  with  (11.27),  we  find  that  the  secant  method  for  the  case  of  n  =  1  is 
defined  by 


Bk  —  ( F{xk )  -  F(xk- i))/(xk  -  xk —  i ) ,  (A.60a) 

Pk  —  -F(xk)/Bk,  xk+1  =xk  +  pk.  (A.60b) 

By  illustrating  this  algorithm,  we  see  the  origin  of  the  term  “secant.”  Bk  approximates  the 
slope  of  the  function  at  xk  by  taking  the  secant  through  the  points  (. xk-\ ,  F{xk-\)  and 
(xk,  F(xk)),  and  xk+\  is  obtained  by  finding  the  intersection  of  this  secant  with  the  x  axis. 
The  method  is  illustrated  in  Figure  A.3. 


Appendix 


A  Regularization 
Procedure 


The  following  algorithm  chooses  parameters  S ,  y  that  guarantee  that  the  regularized  primal- 
dual  matrix  ( 19.25)  is  nonsingular  and  satisfies  the  inertia  condition  (19.24).  The  algorithm 
assumes  that,  at  the  beginning  of  the  interior-point  iteration,  S0id  has  been  initialized  to 
zero. 


Algorithm  B.1  (Inertia  Correction  and  Regularization). 

Given  the  current  barrier  parameter  /z,  constants  ;;  >  0  and  /3  <  1,  and  the 
perturbation  used  in  the  previous  interior-point  iteration. 
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Factor  ( 19.25)  with  8  —  y  —  0. 

if  (19.25)  is  nonsingular  and  its  inertia  is  ( n  +  m,  l  +  m,  0) 
compute  the  primal-dual  step;  stop; 
if  (19.25)  has  zero  eigenvalues 


set  y 

a-  10  ~siyv 

if  Sold 

=  0 

set  8 

1 

o 

1 

else 

set  8 

80idl 2; 

repeat 

Factor  the  modified  matrix  (19.25); 
if  the  inertia  is  (n  +  m,  l  +  m ,  0) 

Set  S 0id  <  <5; 

Compute  the  primal-dual  step  (19.12)  using  the  coefficient 
matrix  (19.25);  stop; 

else 

Set  S  < —  105; 

end  (repeat) 

This  algorithm  has  been  adapted  from  a  more  elaborate  procedure  described  by 
Wachter  and  Biegler  [301].  All  constants  used  in  the  algorithm  are  arbitrary;  we  have  pro¬ 
vided  typical  choices.  The  algorithm  aims  to  avoid  unnecessarily  large  modifications  SI  of 
Vxx£  while  trying  to  minimize  the  number  of  matrix  factorizations.  Excessive  modifications 
degrade  the  performance  of  the  algorithm  because  they  erase  the  second  derivative  informa¬ 
tion  contained  in  Vxx£,  and  cause  the  step  to  take  on  steepest-descent  like  characteristics. 
The  first  trial  value  (5  =  80id/ 2)  is  based  on  the  previous  modification  S0/d  because  the 
minimum  perturbation  8  required  to  achieve  the  desired  inertia  will  often  not  vary  much 
from  one  interior-point  iteration  to  the  next. 

The  heuristics  implemented  in  Algorithm  B.  1  provide  an  alternative  to  those  employed 
in  Algorithm  7.3,  which  were  presented  in  the  context  of  unconstrained  optimization.  We 
emphasize,  however,  that  all  of  these  are  indeed  heuristics  and  may  not  always  provide 
adequate  safeguards. 
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Accumulation  point,  see  Limit  point 
Active  set,  308,  323,  336,  342 
Affine  scaling 

direction,  395,  398,  414 
method,  417 

Alternating  variables  method,  see  also 
Coordinate  search  method,  104, 
230 

Angle  test,  41 
Applications 

design  optimization,  1 
finance,  7 

portfolio  optimization,  1,  449-450,  492 
transportation,  4 

Armijo  line  search,  see  Line  search,  Armijo 
Augmented  Lagrangian  function,  423 
as  merit  function,  436 
definition,  514 


exactness  of,  517-518 
example,  516 

Augmented  Lagrangian  method,  422,  498, 
514-526 

convergence,  518-519 
framework  for,  515 
implementation,  519-523 
LANCELOT,  175,  519-522 
motivation,  514-515 
Automatic  differentiation,  170,  194 
adjoint  variables,  208,  209 
and  graph-coloring  algorithms,  212, 
216-218 

checkpointing,  210 
common  expressions,  211 
computational  graph,  205-206,  208, 
210,211,213,215 


654  Index 


Automatic  ( cont .) 

computational  requirements,  206-207, 
210,214,216,219 
forward  mode,  206-207,  278 
forward  sweep,  206,  208,  210,  213-215, 
219 

foundations  in  elementary  arithmetic, 
194,  204 

Hessian  calculation 

forward  mode,  213-215 
interpolation  formulae,  214-215 
reverse  mode,  215-216 
intermediate  variables,  205-209,  211, 
212,218 

Jacobian  calculation,  210-213 
forward  mode,  212 
reverse  mode,  212-213 
limitations  of,  216-217 
reverse  mode,  207-210 
reverse  sweep,  208-210,  218 
seed  vectors,  206,  207,  212,  213,  216 
software,  194,  210,  217 

Backtracking,  37,  240 
Barrier  functions,  566,  583 
Barrier  method,  563-566 
primal,  583 
Basic  variables,  429 
Basis  matrix,  429-431 
BFGS  method,  24,  29,  136-143 
damping,  537 
implementation,  142-143 
properties,  141-142,  161 
self-correction,  142 
skipping,  143,  537 

Bound-constrained  optimization,  97, 
485-490 
BQPD,  490 

Broyden  class,  see  Quasi-Newton  method, 
Broyden  class 

Broyden’s  method,  273,  274,  284,  285,  302, 
634 

derivation  of,  279-281 
limited-memory  variants,  283 
rate  of  convergence,  281-283 
statement  of  algorithm,  281 
Byrd-Omojokun  method,  547,  579 


Calculus  of  variations,  9 
Cancellation  error,  see  Floating-point 
arithmetic,  cancellation 
Cauchy  point,  71-73,  76,  77,  93,  100,  170, 
172,  262,486 
calculation  of,  71-72,  96 
for  nonlinear  equations,  291-292 
role  in  global  convergence,  77-79 
Cauchy  sequence,  618 
Cauchy-Schwarz  inequality,  75,  99,  151, 
600 

Central  path,  397-399,  417 

for  nonlinear  problems,  565,  584, 

594 

neighborhoods  of,  399-401,  403,  406, 
413 

Chain  rule,  29,  194,  204,  206-208,  213, 
625,  627,  629 

Cholesky  factorization,  87,  141,  143, 

161,  251,  259,  289,  292,  454,  599, 
608-609,617 
incomplete,  174 
modified,  48,  51-54,  63,  64,  76 
bounded  modified  factorization 
property,  48 
sparse,  412-413 
stability  of,  53,  617 
Classification  of  algorithms,  422 
Combinatorial  difficulty,  424 
Complementarity  condition,  70,  313,  321, 
333,  397 

strict,  321,  337,  342,  533,  565,  591 
Complementarity  problems 
linear  (LCP),  415 
nonlinear  (NCP),  417 
Complexity  of  algorithms,  388-389,  393, 
406,415,417 

Conditioning,  see  also  Matrix,  condition 
number,  426,  430-432,  616-617 
ill  conditioned,  29,  502,  514,  586,  616 
well  conditioned,  616 
Cone,  621 

Cone  of  feasible  directions,  see  Tangent 
cone 

Conjugacy,  25,  102 
Conjugate  direction  method,  103 
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expanding  subspace  minimization,  106, 
172,  173 

termination  of,  103 

Conjugate  gradient  method,  71,  101-132, 
166,  170-173,253,  278 
«-step  quadratic  convergence,  133 
clustering  of  eigenvalues,  116 
effect  of  condition  number,  117 
expanding  subspace  minimization,  112 
Fletcher-Reeves,  see  Fletcher-Reeves 
method 

for  reduced  system,  459-461 
global  convergence,  40 
Hestenes-Stiefel,  123 
Krylov  subspace,  113 
modified  for  indefiniteness,  169-170 
nonlinear,  25,  121-131 
numerical  performance,  131 
optimal  polynomial,  113 
optimal  process,  112 
Polak-Ribiere,  see  Polak-Ribiere 
method 

practical  version,  111 
preconditioned,  118-119,  170,460 
projected,  461-463,  548,  571,  581,  593 
rate  of  convergence,  112 
relation  to  limited-memory,  180 
restarts,  124 

superlinear  convergence,  132 
superquadratic,  133 
termination,  115,  124 

Constrained  optimization,  6 

nonlinear,  4,  6,  211,  293,  356,  421,  498, 
500 

Constraint  qualifications,  315-320,  333, 

338- 340,  350 

linear  independence  (LICQ),  320,  321, 
323,  339,  341,  358,  464,  503,  517, 
533,  557,  565,  591 
Mangasarian-Fromovitz  (MFCQ), 

339- 340 

Constraints,  2,  307 
bounds,  434,  519,  520 
equality,  305 
inequality,  305 

Continuation  methods  for  nonlinear 
equations,  274,  303 


application  to  KKT  conditions  for 
nonlinear  optimization,  565 
convergence  of,  300-301 
formulation  as  initial-value  ODE, 
297-299 

motivation,  296-297 
predictor-corrector  method,  299-300 
zero  path,  296-301,  303 
divergence  of,  300-301 
tangent,  297-300 
turning  point,  296,  297,  300 
Convergence,  rate  of,  619-620 
//-step  quadratic,  133 
linear,  262,619,  620 
quadratic,  23,  29,  49,  168,  257,  619,  620 
sublinear,  29 

superlinear,  23,  29,  73,  132,  140,  142, 
160,  161,  168,  262-265,  414,  619, 
620 

superquadratic,  133 
Convex  combination,  621 
Convex  hull,  621 
Convex  programming,  7,  8,  335 
Convexity,  7-8 

of  functions,  8,  16-17,  28,  250 
of  sets,  8,  28,  352 
strict,  8 

Coordinate  descent  method,  see 

Alternating  variables  method,  233 
Coordinate  relaxation  step,  431 
Coordinate  search  method,  135,  230-231 
CPLEX,  490 
Critical  cone,  330 

Data-fitting  problems,  11-12,  248 
Degeneracy,  465 

of  basis,  366,  369,  372,  382 
of  linear  program,  366 
Dennis  and  More  characterization,  47 
Descent  direction,  21,  29,  30 
DFP  method,  139 
Differential  equations 
ordinary,  299 
partial,  216,  302 
Direct  sum,  603 

Directional  derivative,  206,  207,  437, 
628-629 

Discrete  optimization,  5-6 
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Dual  slack  variables,  359 
Dual  variables,  see  also  Lagrange 
multipliers,  359 
Duality,  350 

in  linear  programming,  359-362 
in  nonlinear  programming,  343-349 
weak,  345,  361 

Eigenvalues,  84,  252,  337,  599,  603,  613 
negative,  77,  92 
of  symmetric  matrix,  604 
Eigenvectors,  84,  252,  603 
Element  function,  186 
Elimination  of  variables,  424 

linear  equality  constraints,  428-433 
nonlinear,  426-428 

when  inequality  constraints  are  present, 
434 

Ellipsoid  algorithm,  389,  393,  417 
Error 

absolute,  614 

relative,  196,  251,  252,  614,  617 
truncation,  216 

Errors-in-variables  models,  265 

Feasibility  restoration,  439-440 
Feasible  sequences,  316-325,  332-333,  336 
limiting  directions  of,  316-325,  329,  333 
Feasible  set,  3,  305,  306,  338 

geometric  properties  of,  340-341 
primal,  358 

primal-dual,  397,  399,  405,  414 
Filter  method,  437-440 
Filters,  424,  437-440,  575,  589 
for  interior-point  methods,  575 
Finite  differencing,  170, 193-204,  216,  268, 
278 

and  graph-coloring  algorithms,  202-204 
and  noise,  221 

central-difference  formula,  194, 
196-197,  202,217 

forward-difference  formula,  195,  196, 
202,217 

gradient  approximation,  195-197 
graph-coloring  algorithms  and,  200-201 
Hessian  approximation,  201-204 
Jacobian  approximation,  197-201,  283 


First-order  feasible  descent  direction, 
310-315 

First-order  optimality  conditions,  see 

also  Karush-Kuhn-Tucker  (KKT) 
conditions,  90,  275,  307-329,  340, 
352 

derivation  of,  315-329 
examples,  308-315,  317-319,  321-322 
fundamental  principle  of,  325-326 
unconstrained  optimization,  14-15,  513 
Fixed-regressor  model,  248 
Fletcher-Reeves  method,  102,  121-131 
convergence  of,  125 
numerical  performance,  131 
Floating-point  arithmetic,  216,  614-615, 
617 

cancellation,  431,  615 
double-precision,  614 
roundoff  error,  195,  217,  251,  615 
unit  roundoff,  196,  217,  614 
Floating-point  numbers,  614 
exponent,  614 
fractional  part,  614 

Forcing  sequence,  see  Newton’s  method, 
inexact,  forcing  sequence 
Function 

continuous,  623-624 
continuously  differentiable,  626,  631 
derivatives  of,  625-630 
differentiable,  626 
Lipschitz  continuous,  624,  630 
locally  Lipschitz  continuous,  624 
one-sided  limit,  624 
univariate,  625 
Functions 

smooth,  10,  14,  306-307,  330 
Fundamental  theorem  of  algebra,  603 

Gauss-Newton  method,  254-258,  263, 
266,  275 

connection  to  linear  least  squares,  255 
line  search  in,  254 
performance  on  large-residual 
problems,  262 

Gaussian  elimination,  51,  430,  455,  609 
sparse,  430,  433 
stability  of,  617 

with  row  partial  pivoting,  607,  617 
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Global  convergence,  77-92,  261,  274 
Global  minimizer,  12-13,  16,  17,  502,  503 
Global  optimization,  6-8,  422 
Global  solution,  see  also  Global  minimizer, 
6,  69-70,  89-91,  305,  335,  352 
GMRES,  278,  459,  492,  571 
Goldstein  condition,  36,  48 
Gradient,  625 
generalized,  18 

Gradient  projection  method,  464, 
485-490,  492,  521 

Group  partial  separability,  see  Partially 

separable  function,  group  partially 
separable 

Hessian,  14,  19,  20,  23,  26,  626 
average,  138,  140 
Homotopy  map,  296 
Homotopy  methods,  see  Continuation 
methods  for  nonlinear  equations 

Implicit  filtering,  240-242 
Implicit  function  theorem,  324, 

630-631 

Inexact  Newton  method,  see  Newton’s 
method,  inexact 
Infeasibility  measure,  437 
Inner  product,  599 
Integer  programming,  5, 416 
branch-and-bound  algorithm,  6 
Integral  equations,  302 
Interior-point  methods,  see  Primal-dual 
interior-point  methods 
nonlinear,  see  Nonlinear  interior-point 
method 

Interlacing  eigenvalue  theorem,  613 
Interpolation  conditions,  223 
Invariant  subspace,  see  Partially  separable 
optimization,  invariant  subspace 
Iterative  refinement,  463 

Jacobian,  246,  254,  256,  269,  274,  324,  395, 
504,  627,  630 

Karmarkar’s  algorithm,  389,  393,  417 
Karush-Kuhn-Tucker  (KKT)  conditions, 
330,  332,  333,  335-337,  339,  350, 
354,  503,  517,  520,  528 


for  general  constrained  problem,  32 1 
for  linear  programming,  358-360,  367, 
368,  394-415 

for  linear  programming,  394 
KNITRO,  490,  525,  583,  592 
Krylov  subspace,  108 
method,  459 

L-BFGS  algorithm,  177-180,  183 
Lagrange  multipliers,  310,  330,  333,  337, 
339,  341-343,  353,  358,  360,  419, 
422 

estimates  of,  503,  514,  515,  518,  521, 
522,  584 

Lagrangian  function,  90,  310,  313,  320, 
329,  330,  336 

for  linear  program,  358,  360 
Hessian  of,  330,  332,  333,  335,  337,  358 
LANCELOT,  520,  525,  592 
Lanczos  method,  77,  166,  175-176 
LAPACK,  607 

Least-squares  multipliers,  581 
Least-squares  problems,  linear,  250-254 
normal  equations,  250-251,  255,  259, 
412 

sensitivity  of  solutions,  252 
solution  via  QR  factorization,  251-252 
solution  via  SVD,  252-253 
Least-squares  problems,  nonlinear,  12,  210 
applications  of,  246-248 
Dennis-Gay-Welsch  algorithm, 
263-265 

Fletcher-Xu  algorithm,  263 
large-residual  problems,  262-265 
large-scale  problems,  257 
scaling  of,  260-261 
software  for,  263,  268 
statistical  justification  of,  249-250 
structure,  247,  254 
Least-squares  problems,  total,  265 
Level  set,  92,  261 

Levenberg-Marquardt  method,  258-262, 
266,  289 

as  trust-region  method,  258-259,  292 
for  nonlinear  equations,  292 
implementation  via  orthogonal 
transformations,  259-260 
inexact,  268 
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Levenberg-Marquardt  ( cont .) 
local  convergence  of,  262 
performance  on  large-residual 
problems,  262 
lim  inf,  lim  sup,  618-619 
Limit  point,  28,  79,  92,  99,  502,  503,  618, 
620 

Limited-memory  method,  25,  176-185, 
190 

compact  representation,  181-184 
for  interior-point  method,  575,  597 
L-BFGS,  176-180,  538 
memoryless  BFGS  method,  180 
performance  of,  179 
relation  to  CG,  180 
scaling,  178 
SRI,  183 

two-loop  recursion,  178 
Line  search,  see  also  Step  length  selection 
Armijo,  33,  48,  240 
backtracking,  37 
curvature  condition,  33 
Goldstein,  36 
inexact,  3 1 

Newton’s  method  with,  22-23 
quasi-Newton  methods  with,  23-25 
search  directions,  20-25 
strong  Wolfe  conditions,  see  Wolfe 
conditions,  strong 
sufficient  decrease,  33 
Wolfe  conditions,  see  Wolfe  conditions 
Line  search  method,  19-20,  30-48,  66,  67, 
71,230-231,247 

for  nonlinear  equations,  271,  285, 
287-290 

global  convergence  of,  287-288 
poor  performance  of,  288-289 
Linear  programming,  4,  6,  7,  9,  293 
artificial  variables,  362,  378-380 
basic  feasible  points,  362-366 
basis  B,  362-368,  378 
basis  matrix,  363 
dual  problem,  359-362 
feasible  polytope,  356 
vertices  of,  365-366 
fundamental  theorem  of,  363-364 
infeasible,  356,  357 


nonbasic  matrix,  367 
primal  solution  set,  356 
slack  and  surplus  variables,  356,  357, 
362,  379,  380 
splitting  variables,  357 
standard  form,  356-357 
unbounded,  356,  357,  369 
warm  start,  410,  416 

Linearly  constrained  Lagrangian  methods, 
522-523,  527 
MINOS,  523,  527 
Linearly  dependent,  337 
Linearly  independent,  339,  503,  504,  517, 
519,  602 

Lipschitz  continuity,  see  also  Function, 
Lipschitz  continuous,  80,  93,  256, 
257,  261,  269,  276-278,  287,  294 
Local  minimizer,  12,  14,  273 
isolated,  13,  28 
strict,  13,  14,  16,  28,  517 
weak,  12 

Local  solution,  see  also  Local  minimizer,  6, 
305-306,  316,  325,  329,  332,  340, 
342,  352,  513 
isolated,  306 
strict,  306,  333,  335,  336 
strong,  306 

Log-barrier  function,  417,  597 
definition,  583-584 
difficulty  of  minimizing,  584-585 
example,  586 

ill  conditioned  Hessian  of,  586 
Log-barrier  method,  498,  584 
LOQO,  490,  592 

LSQR  method,  254,  268,  459,  492,  571 
LU  factorization,  606-608 

Maratos  effect,  440-446,  543,  550 
example  of,  440,  543 
remedies,  442 
Matlab,  416 
Matrix 

condition  number,  251,  601-602,  604, 
610,616 

determinant,  154,  605-606 
diagonal,  252,  412,  429,  599 
full-rank,  298,  300,  504,  609 
identity,  599 
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indefinite,  76 
inertia,  55,  454 

lower  triangular,  599,  606,  607 
modification,  574 
nonsingular,  325,  337,  601,  612 
null  space,  298,  324,  337,  430,  432,  603, 
608,  609 

orthogonal,  251,  252,  337, 432,  599,  604, 
609 

permutation,  251,  429,  606 
positive  definite,  15,  16,  23,  28,  68,  76, 
337,  599,  603,  609 

positive  semidefinite,  8,  15,  70,  415,  599 

projection,  462 

range  space,  430,  603 

rank-deficient,  253 

rank-one,  24 

rank-two,  24 

singular,  337 

sparse,  411,  413,  607 

Cholesky  factorization,  413 
symmetric,  24,  68,  412,  599,  603 
symmetric  indefinite,  413 
symmetric  positive  definite,  608 
trace,  154,  605 
transpose,  599 

upper  triangular,  251,  337,  599,  606, 
607,  609 

Maximum  likelihood  estimate,  249 
Mean  value  theorem,  629-630 
Merit  function,  see  also  Penalty  function, 
435-437,  446 

h,  293,  435-436,  513,  540-543,  550 
choice  of  parameter,  543 
exact,  435-436 
definition  of,  435 
nonsmoothness  of,  513 
Fletcher’s  augmented  Lagrangian,  436, 
540 

for  feasible  methods,  435 
for  nonlinear  equations,  273,  285-287, 
289,  290,  293,  296,  301-303,  505 
for  SQP,  540-543 
Merit  functions,  424,  575 
Method  of  multipliers,  see  Augmented 
Lagrangian  method 


MINOS,  see  also  Linearly  constrained 

Lagrangian  methods,  523,  525,  592 
Model-based  methods  for  derivative-free 
optimization,  223-229 
minimum  Frobenius  change,  228 
Modeling,  2,  9,  11,  247-249 
Monomial  basis,  227 
MOSEK,  490 

Multiobjective  optimization,  437 

Negative  curvature  direction,  49,  50,  63, 
76,  169-172,  175,  489,491 
Neighborhood,  13,  14,  28,  256,  621 
Network  optimization,  358 
Newton’s  method,  25,  247,  254,  257,  263 
for  log-barrier  function,  585 
for  nonlinear  equations,  271,  274-277, 
281,  283,  285,  287-290,  294,  296, 
299,  302 
cycling,  285 
inexact,  277-279,  288 
for  quadratic  penalty  function,  501,  506 
global  convergence,  40 
Hessian-free,  165,  170 
in  one  variable,  84-87,  91,  633 
inexact,  165-168,  171,  213 

forcing  sequence,  166-169,  171,  277 
large  scale 

LANCELOT,  175 
line  search  method,  49 
TRON,  175 
modified,  48-49 

adding  a  multiple  of  I,  51 
eigenvalue  modification,  49-51 
Newton-CG,  202 
line  search,  168-170 
preconditioned,  174-175 
trust-region,  170-175 
Newton-Lanczos,  175-176,  190 
rate  of  convergence,  44,  76,  92,  166-168, 
275-277,  281-282,  620 
scale  invariance,  27 
Noise  in  function  evaluation,  221-222 
Nondifferentiable  optimization,  511 
Nonlinear  equations,  197,  210,  213,  633 
degenerate  solution,  274,  275,  283,  302 
examples  of,  271-272,  288-289, 
300-301 
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Nonlinear  ( cont .) 

merit  function,  see  Merit  function,  for 
nonlinear  equations 
multiple  solutions,  273-274 
nondegenerate  solution,  274 
quasi-Newton  methods,  see  Broyden’s 
method 

relationship  to  least  squares,  271-272, 
275,  292-293,  302 
relationship  to  optimization,  271 
relationship  to  primal-dual 

interior-point  methods,  395 
solution,  271 

statement  of  problem,  270-271 
Nonlinear  interior-point  method,  423, 
563-593 

barrier  formulation,  565 
feasible  version,  576 
global  convergence,  589 
homotopy  formulation,  565 
superlinear  convergence,  591 
trust-region  approach,  578 
Nonlinear  least-squares,  see  Least-squares 
problems,  nonlinear 

Nonlinear  programming,  see  Constrained 
optimization,  nonlinear 
Nonmonotone  strategy,  18,  444-446 
relaxed  steps,  444 
Nonnegative  orthant,  97 
Nonsmooth  functions,  6,  17-18,  306,  307, 
352 

Nonsmooth  penalty  function,  see  Penalty 
function,  nonsmooth 

Norm 
dual,  601 

Euclidean,  25,  51,  251,  280,  302,  600, 
601,605,610 

Frobenius,  50,  138,  140,  601 
matrix,  601-602 
vector,  600-601 
Normal  cone,  340-341 
Normal  distribution,  249 
Normal  subproblem,  580 
Null  space,  see  Matrix,  null  space 
Numerical  analysis,  355 

Objective  function,  2,  10,  304 
One-dimensional  minimization,  19,  56 


OOPS,  490 
OOQP,  490 

Optimality  conditions,  see  also  First-order 
optimality  conditions.  Second- 
order  optimality  conditions,  2,  9, 
305 

for  unconstrained  local  minimizer, 
14-17 

Order  notation,  631-633 
Orthogonal  distance  regression,  265-267 
contrast  with  least  squares,  265-266 
structure,  266-267 

Orthogonal  transformations,  251, 259-260 
Givens,  259,  609 
Householder,  259,  609 

Partially  separable  function,  25,  186-189, 
211 

automatic  detection,  211 
definition,  211 

Partially  separable  optimization,  165 
BFGS,  189 

compactifying  matrix,  188 
element  variables,  187 
quasi-Newton  method,  188 
SRI,  189 

Penalty  function,  see  also  Merit  function, 
498 

h, 507-513 

exact,  422-423,  507-513 
nonsmooth,  497,  507-513 
quadratic,  see  also  Quadratic  penalty 
method,  422,  498-507,  525-527, 
586 

difficulty  of  minimizing,  501-502 
Hessian  of,  505-506 
relationship  to  augmented 
Lagrangian,  514 
unbounded,  500 

Penalty  parameter,  435,  436,  498,  500,  501, 
507,  514,  521,  525 
update,  511,  512 
PENNON,  526 
Pivoting,  251,  617 
Polak-Ribiere  method,  122 
convergence  of,  130 
Polak-Ribiere  method 

numerical  performance,  131 
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Polynomial  bases,  226 
monomials,  227 

Portfolio  optimization,  see  Applications, 
portfolio  optimization 
Preconditioners,  118-120 
banded,  120 
constraint,  463 

for  constrained  problems,  462 
for  primal-dual  system,  571 
for  reduced  system,  460 
incomplete  Cholesky,  120 
SSOR,  120 

Preprocessing,  see  Presolving 
Presolving,  385-388 
Primal  interior-point  method,  570 
Primal-dual  interior-point  methods,  389, 
597 

centering  parameter,  396,  398,  401,  413 
complexity  of,  393,  406,  415 
contrasts  with  simplex  method,  356, 
393 

convex  quadratic  programs,  415 
corrector  step,  414 
duality  measure,  395,  398 
infeasibility  detection,  411 
linear  algebra  issues,  41 1-413 
Mehrotra’s  predictor-corrector 
algorithm,  393,  407-411 
path-following  algorithms,  399-414 
long-step,  399-406 
predictor-corrector  (Mizuno- 
Todd-Ye)  algorithm, 

413 

short- step,  413 
potential  function,  414 
Tanabe-Todd-Ye,  414 
potential-reduction  algorithms,  414 
predictor  step,  413 
quadratic  programming,  480-485 
relationship  to  Newton’s  method,  394, 
395 

starting  point,  410-41 1 
Primal-dual  system,  567 
Probability  density  function,  249 
Projected  conjugate  gradient  method, 
see  Conjugate  gradient  method, 
projected 


Projected  Hessian,  558 
two-sided,  559 
Proximal  point  method,  523 

QMR  method,  459,  492,  571 
QPA,  490 
QPOPT,  490 

QR  factorization,  251,  259,  290,  292,  298, 
337,  432,  433,  609-610 
cost  of,  609 

relationship  to  Cholesky  factorization, 
610 

Quadratic  penalty  method,  see  also  Penalty 
function,  quadratic,  497,  501-502, 
514 

convergence  of,  502-507 
Quadratic  programming,  422,  448-492 
active-set  methods,  467-480 
big  M  method,  473 
blocking  constraint,  469 
convex,  449 
cycling,  477 
duality,  349,  490 
indefinite,  449,  467,  491-492 
inertia  controlling  methods,  491,  492 
initial  working  set,  476 
interior-point  method,  480-485 
nonconvex,  see  Quadratic  programming, 
indefinite 

null-space  method,  457-459 
optimal  active  set,  467 
optimality  conditions,  464 
phase  I,  473 

Schur-complement  method,  455-456 
software,  490 

strictly  convex,  349,  449,  472, 

477-478 

termination,  477-478 
updating  factorizations,  478 
working  set,  468-478 
Quasi-Newton  approximate  Hessian,  23, 
24,  73,  242,  634 

Quasi-Newton  method,  25,  165,  247,  263, 
501,585 

BFGS,  see  BFGS  method,  263 
bounded  deterioration,  161 
Broyden  class,  149-152 
curvature  condition,  137 
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DFP,  see  DFP  method,  190,  264 
for  interior-point  method,  575 
for  nonlinear  equations,  see  Broyden’s 
method 

for  partially  separable  functions,  25 
global  convergence,  40 
large-scale,  165-189 
limited  memory,  see  Limited  memory 
method 

rate  of  convergence,  46,  620 
secant  equation,  24,  137,  139,  263-264, 
280,  634 

sparse,  see  Sparse  quasi-Newton  method 

Range  space,  see  Matrix,  range  space 
Regularization,  574 
Residuals,  11,  245,  262-265,  269 
preconditioned,  462 
vector  of,  18,  197,  246 
Restoration  phase,  439 
Robust  optimization,  7 
Root,  see  Nonlinear  equations,  solution 
Rootfinding  algorithm,  see  also  Newton’s 
method,  in  one  variable,  259,  260, 
633 

for  trust-region  subproblem,  84-87 
Rosenbrock  function 
extended,  191 

Roundoff  error,  see  Floating-point 
arithmetic,  roundoff  error 
Row  echelon  form,  430 

SQQP  method,  293,  549 
Saddle  point,  28,  92 
Scale  invariance,  27,  138,  141 

of  Newton’s  method,  see  Newton’s 
method,  scale  invariance 
Scaling,  26-27,  95-97,  342-343,  585 
example  of  poor  scaling,  26-27 
matrix,  96 

Schur  complement,  456,  611 
Secant  method,  see  also  Quasi-Newton 
method,  280,  633,  634 
Second-order  correction,  442-444,  550 
Second-order  optimality  conditions, 
330-337,  342,  602 

for  unconstrained  optimization,  15-16 


necessary,  92,  331 
sufficient,  333-336,  517,  557 
Semidefinite  programming,  415 
Sensitivity,  252,  616 

Sensitivity  analysis,  2,  194,  341-343,  350, 
361 

Separable  function,  186 
Separating  hyperplane,  327 
Sequential  linear- quadratic  programming 
(SLQP),  293,  423,  534 
Sequential  quadratic  programming,  423, 
512, 523, 529-560 
Byrd-Omojokun  method,  547 
derivation,  530-533 
full  quasi-Newton  Hessian,  536 
identification  of  optimal  active  set,  533 
IQP  vs.  EQP,  533 
KKT  system,  275 
least-squares  multipliers,  539 
line  search  algorithm,  545 
local  algorithm,  532 
Newton-KKT  system,  531 
null-space,  538 
QP  multipliers,  538 
rate  of  convergence,  557-560 
reduced-Hessian  approximation, 
538-540 

relaxation  constraints,  547 
St  iQP  method,  see  Si  iQP  method 
step  computation,  545 
trust-region  method,  546-549 
warm  start,  545 
Set 

affine,  622 
affine  hull  of,  622 
bounded,  620 
closed,  620 
closure  of,  621 
compact,  621 
interior  of,  621 
open,  620 

relative  interior  of,  622,  623 
Sherman-Morrison- Woodbury  formula, 
139,  140,  144,  162,  283,  377, 
612-613 
Simplex  method 

as  active-set  method,  388 
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basis  B,  365 
complexity  of,  388-389 
cycling,  381-382 

lexicographic  strategy,  382 
perturbation  strategy,  381-382 
degenerate  steps,  372,  381 
description  of  single  iteration,  366-372 
discovery  of,  355 
dual  simplex,  366,  382-385 
entering  index,  368,  370,  372,  375-378 
finite  termination  of,  368-370 
initialization,  378-380 
leaving  index,  368,  370 
linear  algebra  issues,  372-375 
Phase  I/Phase  II,  378-380 
pivoting,  368 

pricing,  368,  370,  375-376 
multiple,  376 
partial,  376 
reduced  costs,  368 
revised,  366 

steepest-edge  rule,  376-378 
Simulated  annealing,  221 
Singular  values,  255,  604 
Singular-value  decomposition  (SVD),  252, 
269, 303,  603-604 
Slack  variables,  see  also  Linear 

programming,  slack/surplus 
variables,  424,  519 
SNOPT,  536,  592 
Software 
BQPD,  490 
CPLEX,  490 

for  quadratic  programming,  490 

IPOPT,  183,  592 

KNITRO,  183,  490,  525,  592 

L-BFGS-B,  183 

LANCELOT,  520,  525,  592 

LOQO,  490,  592 

MINOS,  523,  525,  592 

MOSEK,  490 

OOPS,  490 

OOQP,  490 

PENNON,  526 

QPA,  490 

QPOPT,  490 

SNOPT,  592 


TRON,  175 
VE09,  490 
XPRESS-MP,  490 

Sparse  quasi-Newton  method,  185-186, 
190 

SRI  method,  24,  144,  161 
algorithm,  146 

for  constrained  problems,  538,  540 
limited-memory  version,  177,  181,  183 
properties,  147 
safeguarding,  145 
skipping,  145,  160 
Stability,  616-617 
Starting  point,  18 

Stationary  point,  15,  28,  289,  436,  505 
Steepest  descent  direction,  20,  21,  71,  74 
Steepest  descent  method,  21,  25-27,  31, 
73,  95,  585 

rate  of  convergence,  42,  44,  620 
Step  length,  19,  30 
unit,  23,  29 

Step  length  selection,  see  also  Line  search, 
56-62 

bracketing  phase,  57 
cubic  interpolation,  59 
for  Wolfe  conditions,  60 
initial  step  length,  59 
interpolation  in,  57 
selection  phase,  57 
Stochastic  optimization,  7 
Stochastic  simulation,  221 
Strict  complementarity,  see 

Complementarity  condition,  strict 
Subgradient,  17 
Subspace,  602 
basis,  430,  603 
orthonormal,  432 
dimension,  603 
spanning  set,  603 
Sufficient  reduction,  71,  73,  79 
Sum  of  absolute  values,  249 
Sum  of  squares,  see  Least-squares 
problems,  nonlinear 
Symbolic  differentiation,  194 
Symmetric  indefinite  factorization,  455, 
570,  610-612 
Bunch-Kaufman,  612 
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Bunch-Parlett,  611 
modified,  54-56,  63 
sparse,  612 

Symmetric  rank-one  update,  see  SRI 
method 

Tangent,  315-325 
Tangent  cone,  319,  340-341 
Taylor  series,  15,  22,  28,  29,  67,  274,  309, 
330,  332,  334,  502 

Taylor’s  theorem,  15,  21-23,  80,  123,  138, 
167,  193-195,  197,  198,  202,  274, 
280,  294,  323,  325,  332,  334,  341, 
630 

statement  of,  14 
Tensor  methods,  274 
derivation,  283-284 
Termination  criterion,  92 
Triangular  substitution,  433,  606,  609,  617 
Truncated  Newton  method,  see  Newton’s 
method,  Newton-CG,  line-search 
Trust  region 

boundary,  69,  75,  95,  171-173 
box-shaped,  19,  293 
choice  of  size  for,  67,  81 
elliptical,  19,  67,  95,  96,  100 
radius,  20,  26,  68,  69,  73,  258,  294 
spherical,  95,  258 

Trust-region  method,  19-20,  69,  77,  79, 
80,  82,  87,91,247,  258,  633 
contrast  with  line  search  method,  20, 
66-67 

dogleg  method,  71,  73-77,  79,  84,  91, 
95,  99,  173,291-293,  548 
double-dogleg  method,  99 
for  derivative-free  optimization,  225 
for  nonlinear  equations,  271,  273,  285, 
290-296 

global  convergence  of,  292-293 
local  convergence  of,  293-296 
global  convergence,  71,  73,  76-92,  172 


local  convergence,  92-95 
Newton  variant,  26,  68,  92 
software,  98 

Steihaug’s  approach,  77,  170-173, 

489 

strategy  for  adjusting  radius,  69 
subproblem,  19,  25-26,  68,  69,  72,  73, 
76,  77,91,95-97,258 
approximate  solution  of,  68,  71 
exact  solution  of,  71,  77,  79,  83-92 
hard  case,  87-88 

nearly  exact  solution  of,  95,  292-293 
two-dimensional  subspace 

minimization,  71,  76-77,  79,  84, 
95,  98,  100 

Unconstrained  optimization,  6,  352,  427, 
432,499,  501 
of  barrier  function,  584 

Unit  ball,  91 

Unit  roundoff,  see  Floating-point 
arithmetic,  unit  roundoff 

Variable  metric  method,  see  Quasi-Newton 
method 

Variable  storage  method,  see  Limited 
memory  method 

VE09,  490 

Watchdog  technique,  444-446 

Weakly  active  constraints,  342 

Wolfe  conditions,  33-36,  48,  78,  131,  137, 
138,  140-143,  146,  160,  179,  255, 
287,  290 

scale  invariance  of,  36 
strong,  34,  35,  122,  125,  126,  128,  131, 
138,  142,  162,  179 

XPRESS-MP,  490 

Zoutendijk  condition,  38-41,  128,  156, 
287 


